Closed krassowski closed 5 years ago
This is a good idea. How would you address large result sets caused by too unspecific queries, and how would you rank results by relevance?
On Fri, Sep 15, 2017 at 7:57 AM krassowski notifications@github.com wrote:
Currently the search bar uses only gene symbol (HGNC) and mRNA refseqs to look up a gene of interest.
I caught myself trying to find some protein using it's full name (and when it failed, intuitively looking for its SwissProt accession) despite I know that it is not implemented.
The advanced protein search could have an optional "full text search" - i.e. searching in protein descriptions from NCBI. To make user more confident why do they see particular results we could provide:
- a small line indicating where the query string was found ("matched in protein description" / "matched in gene name")
- checkboxes allowing user to decide what to include in the search (with "protein descriptions" - i.e. full text search - disabled by default and "protein name", "gene name", "gene symbol", "Uniprot id", "refseq" enabled)
@reimand0 https://github.com/reimand0 I propose to add this functionality. I will wait some time for a feedback (like other things to search by) or suggestions.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reimandlab/ActiveDriverDB/issues/136, or mute the thread https://github.com/notifications/unsubscribe-auth/ASYC_VxWU4t9wbaZegoPFxGwi0DYflrsks5simYggaJpZM4PY3tj .
This is a good idea. How would you address large result sets caused by too unspecific queries, and how would you rank results by relevance? Please find some ideas addressing these concern below.
For search-bar
I will require >50% match to start autocompletion (most users will just paste the Uniprot id, at least it sounds sane to assume that) so for P04637
we can show one or two matches (sorted by edit distance) after we detect string longer than three characters (P046
). And if user is using a refseq id or gene name there is a little chance that the uniprot ID shows up.
For advanced search We can show everything, sorted by edit distance. The user will still be able to change the scope of the search. We can use pagination if there are too many results.
For search bar We may just use prefix search so we get results for simple cases like "titin" and do not annoy the end-user ("you cannot just find titin? how come!"). We would show one or two autocompletion results, edit-distance sorted, if the user entered more than 3 characters (HGNC codes are short, we will show suggestions for these immediately; with the longer string, there is less chance we will accidentally match both: a HGNC name and a full gene name).
For advanced search These can use full text search for gene names. Again sorting by edit distance and pagination should suffice.
The feature is now implemented.
Please see: https://beta.activedriverdb.org/search/proteins/ and let me know if there is anything to change before deploying on the production server. Edit: there is no pagination.
Looks good, although it may need some systematic testing. Currently we can also search by just one letter and it returns a few top results. Would it make sense to search only three or more letters?
On Sun, Oct 8, 2017 at 11:00 AM krassowski notifications@github.com wrote:
The feature is now implemented.
Please see: https://beta.activedriverdb.org/search/proteins/ and let me know if there is anything to change before deploying on the production server.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/reimandlab/ActiveDriverDB/issues/136#issuecomment-335012281, or mute the thread https://github.com/notifications/unsubscribe-auth/ASYC_Q_L8wX4Saf3XmDKfg8tV_Pp3gEiks5sqON8gaJpZM4PY3tj .
Tests for the new feature-based search are contained in test_gene_search.py, while for autocompletion as a whole in test_search.py.
Currently we can also search by just one letter and it returns a few top results. Would it make sense to search only three or more letters?
No, I do not think so. This would make interface and code base less consistent and more complicated.
For example, there is a gene "T" and it is perfectly reasonable to show it immediately for a user interested in the analysis of such a gene. Also, it does not seem to be a problem by means of server load, at least none I am aware of; in comparison to other features where we query database heavily, this is rather lightweight.
To reduce the number of unnecessary queries I already implemented 200ms latency for searching (in 410e5e9528a30612086140391d313201b00fda86) so we wait for the user to end typing in before sending the query. This is of course an arbitrary value and won't be much of help for slowly typing users, but should prevent results jittering for some of the users.
I merged the current version into master, but I do not close the issue as this feature can be further improved.
Closing for now, please feel free to reopen if any improvement is needed.
Currently the search bar uses only gene symbol (HGNC) and mRNA refseqs to look up a gene of interest.
I caught myself trying to find some protein using it's full name (and when it failed, intuitively looking for its SwissProt accession) despite I know that it is not implemented.
The advanced protein search could have an optional "full text search" - i.e. searching in protein descriptions from NCBI. To make user more confident why do they see particular results we could provide:
@reimand0 I propose to add this functionality. I will wait some time for a feedback (like other things to search by) or suggestions.