matthewhirschey / ddh.org

datadrivenhypothesis.org is a resource to query 100+ GB of raw biological science data to develop data-driven hypotheses
3 stars 7 forks source link

Need better way to rank search output #79

Closed matthewhirschey closed 4 years ago

matthewhirschey commented 4 years ago

While it seemed OK to have to scroll to find the gene of interest (in a small pond of genes), in the case of "TP53", you never get to see the actual gene, because the threshold limits of head=10 means that several other alphabetically ranked genes push TP53 off the bottom of the list.

Need to think about a better way to return gene of interest.

One idea: sequential search. Instead of str_detect... | str_detect, can we

  1. Search gene_name %>% (most specific)
  2. Search aka %>% (most likely alternative)
  3. Search approved_name (most generic)

And then row_bind, but never resort? And then present up to 20 (10 genes, 10 pathways, max) but probably fewer choices?

johnbradley commented 4 years ago

I created a PR to implement the suggestion above but don't know that it 100% solves the issue. Pathways seem really far down and seems unlikely to be noticed now. Should we remove approved_name searching?

Perhaps if we had several examples of searches and what the users would be most likely looking for it would be more obvious what to change. Or what they might be thinking when performing various expected searches.

matthewhirschey commented 4 years ago

The PR looks OK to me. The most popular genes by gene_id are on the methods page of ddh.org (TP53, etc.). But the normal search behavior should be: I'm looking for a gene, and therefore genes will be at the top of the list. If I'm looking for a pathway, then few genes should come up (?) and the pathways should float to the top. If "approved name" is causing too many spurious results to come up, then perhaps

  1. Search gene_name %>% (most specific)
  2. Search aka %>% (most likely alternative)
  3. Search pathway name %>%
  4. Search approved_name from gene list?

Might be a good alternative.

matthewhirschey commented 4 years ago

I also added (and committed) just now some code that will arrange each sub-table by the length of the returned query. For example:

genes_data_symbol <- gene_summary %>%
    filter(str_detect(approved_symbol, find_word_start_regex)) %>%
    mutate(length = str_count(.[[1]])) %>% 
    arrange(length) %>% 
    head(limit_genes) %>% 
    select(-length)

By doing this, the shorter terms (and therefore better matched terms) are returned first. However, I now see a random error, that I'd like you to see if you can recreate in your branch (or if I introduced it just now); my guess is that it is in your branch too..

Warning: Error in writeImpl: Text to be written must be a length-one character vector

Screenshot 2020-04-22 08 52 00

Perform these searches to recreate/test (case does not matter)

Also errors: MDM2, MDM4 (after a quick search of some of the test genes)

Can you look at this @johnbradley ?

johnbradley commented 4 years ago

Will do @matthewhirschey . I think this error means that we are trying to display a vector of multiple items where shiny is expecting a single item for the content of an HTML tag.

matthewhirschey commented 4 years ago

Issue fixed by @johnbradley