reimandlab / ActiveDriverDB

ActiveDriverDB
GNU Lesser General Public License v2.1
12 stars 3 forks source link

Search by protein and gene name and by Uniprot/Swissprot ID + advanced search #136

Closed krassowski closed 5 years ago

krassowski commented 6 years ago

Currently the search bar uses only gene symbol (HGNC) and mRNA refseqs to look up a gene of interest.

I caught myself trying to find some protein using it's full name (and when it failed, intuitively looking for its SwissProt accession) despite I know that it is not implemented.

The advanced protein search could have an optional "full text search" - i.e. searching in protein descriptions from NCBI. To make user more confident why do they see particular results we could provide:

@reimand0 I propose to add this functionality. I will wait some time for a feedback (like other things to search by) or suggestions.

reimand0 commented 6 years ago

This is a good idea. How would you address large result sets caused by too unspecific queries, and how would you rank results by relevance?

On Fri, Sep 15, 2017 at 7:57 AM krassowski notifications@github.com wrote:

Currently the search bar uses only gene symbol (HGNC) and mRNA refseqs to look up a gene of interest.

I caught myself trying to find some protein using it's full name (and when it failed, intuitively looking for its SwissProt accession) despite I know that it is not implemented.

The advanced protein search could have an optional "full text search" - i.e. searching in protein descriptions from NCBI. To make user more confident why do they see particular results we could provide:

  • a small line indicating where the query string was found ("matched in protein description" / "matched in gene name")
  • checkboxes allowing user to decide what to include in the search (with "protein descriptions" - i.e. full text search - disabled by default and "protein name", "gene name", "gene symbol", "Uniprot id", "refseq" enabled)

@reimand0 https://github.com/reimand0 I propose to add this functionality. I will wait some time for a feedback (like other things to search by) or suggestions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reimandlab/ActiveDriverDB/issues/136, or mute the thread https://github.com/notifications/unsubscribe-auth/ASYC_VxWU4t9wbaZegoPFxGwi0DYflrsks5simYggaJpZM4PY3tj .

krassowski commented 6 years ago

This is a good idea. How would you address large result sets caused by too unspecific queries, and how would you rank results by relevance? Please find some ideas addressing these concern below.

  1. Indeed there may be lots of protein isoforms mapped to a single Uniprot ID.

For search-bar I will require >50% match to start autocompletion (most users will just paste the Uniprot id, at least it sounds sane to assume that) so for P04637 we can show one or two matches (sorted by edit distance) after we detect string longer than three characters (P046). And if user is using a refseq id or gene name there is a little chance that the uniprot ID shows up.

For advanced search We can show everything, sorted by edit distance. The user will still be able to change the scope of the search. We can use pagination if there are too many results.

  1. For gene names - there may be many matches too.

For search bar We may just use prefix search so we get results for simple cases like "titin" and do not annoy the end-user ("you cannot just find titin? how come!"). We would show one or two autocompletion results, edit-distance sorted, if the user entered more than 3 characters (HGNC codes are short, we will show suggestions for these immediately; with the longer string, there is less chance we will accidentally match both: a HGNC name and a full gene name).

For advanced search These can use full text search for gene names. Again sorting by edit distance and pagination should suffice.

krassowski commented 6 years ago

The feature is now implemented.

Please see: https://beta.activedriverdb.org/search/proteins/ and let me know if there is anything to change before deploying on the production server. Edit: there is no pagination.

reimand0 commented 6 years ago

Looks good, although it may need some systematic testing. Currently we can also search by just one letter and it returns a few top results. Would it make sense to search only three or more letters?

On Sun, Oct 8, 2017 at 11:00 AM krassowski notifications@github.com wrote:

The feature is now implemented.

Please see: https://beta.activedriverdb.org/search/proteins/ and let me know if there is anything to change before deploying on the production server.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reimandlab/ActiveDriverDB/issues/136#issuecomment-335012281, or mute the thread https://github.com/notifications/unsubscribe-auth/ASYC_Q_L8wX4Saf3XmDKfg8tV_Pp3gEiks5sqON8gaJpZM4PY3tj .

krassowski commented 6 years ago

Tests for the new feature-based search are contained in test_gene_search.py, while for autocompletion as a whole in test_search.py.

Currently we can also search by just one letter and it returns a few top results. Would it make sense to search only three or more letters?

No, I do not think so. This would make interface and code base less consistent and more complicated.

For example, there is a gene "T" and it is perfectly reasonable to show it immediately for a user interested in the analysis of such a gene. Also, it does not seem to be a problem by means of server load, at least none I am aware of; in comparison to other features where we query database heavily, this is rather lightweight.

To reduce the number of unnecessary queries I already implemented 200ms latency for searching (in 410e5e9528a30612086140391d313201b00fda86) so we wait for the user to end typing in before sending the query. This is of course an arbitrary value and won't be much of help for slowly typing users, but should prevent results jittering for some of the users.

krassowski commented 6 years ago

I merged the current version into master, but I do not close the issue as this feature can be further improved.

krassowski commented 5 years ago

Closing for now, please feel free to reopen if any improvement is needed.