Open sujeetvkulkarni opened 10 months ago
I think both (quoted/unquoted) should yield the same result. Dmel\Indy should in both cases be treated as "one word" to search for. If the user adds quotes to it, its just redundant and should be ignored.
However, if the search term contains a space "Dmel Indy" that is different. Unquoted its two search words and I think Robels search sees them as OR. Quoted its "one word" and should only show entries that contain that exact phrase.
Question for @rykahsay what happens is two words are entered? Are they searched using OR? Do we support quote?
Also simple and global search should show the same results (at least for the molecule class that simple search belongs too).
few more examples,
doi ids in global search: 10.3390/v13040551 10.1016/j.celrep.2021.109179 10.1074/mcp.RA120.002295
also return Status Code: 502 Proxy Error
Again, I am assuming Sean will experiment with this following our face to face discussion
@seankim658 @rykahsay what is the status on this?
Right now every search term is passed through this helper function (to my knowledge) before the corresponding mongodb query is created.
In mongodb there are three different search behaviors around multi-word text searches:
"word1 word2"
) then mongodb will only search for exact matches to that phrase. word1 word2
) then mongodb will do an AND search looking for records that contain both of those words."word1" "word2"
) then mongodb will do an OR search.Based on the helper function that pre-processes the search terms, the second case will never happen. If the search term is already fully enclosed in double quotes, it is returned as is (first case). If the search term is not fully enclosed in double quotes, parenthesis and brackets are escaped, hyphens are removed completely, and then each word is individually enclosed in quotes (third case).
In mongodb both the AND and OR style text searches can have decently significant overhead. For the OR search, each search term is searched for independently, meaning that both searches have to do an index scan. For example, if the first search for word1
finds that record 500 has word1
, the second index scan searching for word2
will still have to check that record for the existence of word2
(even though its already been identified by the first search that it should be apart of the result set). The two result sets are then merged and duplicate records are filtered out, adding more overhead. The AND search is even less efficient requiring a second pass to perform the result set intersection during merging.
Regarding scenarios 1 & 2 from Sujeet's original comment:
Since the search term is already enclosed fully by double quotes, it goes through no pre-processing and is used as is. I don't have GlyGen server access so I can't check the logs but I'm pretty sure what is happening is the difference in query type for simple search vs search. The /search
endpoints use a mongodb text search, and the \I
is being interpreted as an invalid escape sequence and that is why the 502 error is happening. The /search_simple
endpoints use a regex search. The regex engine I think is treating the \
as a literal since it determines \I
is not a part of a valid escape sequence.
Searching for the gene symbol "Dmel\Indy" in global search and protein simple search ->
Production behavior ->
https://api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"\"Dmel\Indy\"","term_category":"any"}
result count from list api : 16853 Results
https://api.glygen.org/globalsearch/search?query= {"term":"\"Dmel\Indy\""}
result :
Status Code: 502 Proxy Error
beta behavior ->
https://beta-api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"\"Dmel\Indy\"","term_category":"any"}
search_simple api returns below response 0 result count.
https://beta-api.glygen.org/globalsearch/search?query= {"term":"\"Dmel\Indy\""}
result :
Status Code: 502 Proxy Error
Searching for the gene symbol Dmel\Indy in global search and protein simple search ->
Production behavior ->
https://api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"Dmel\Indy","term_category":"any"}
result: {"list_id": ""}
https://api.glygen.org/globalsearch/search?query= {"term":"Dmel\Indy"}
result:
beta behavior ->
https://beta-api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"Dmel\Indy","term_category":"any"}
result:
https://beta-api.glygen.org/globalsearch/search?query= {"term":"Dmel\Indy"}
result: Status Code: 502 Proxy Error
@ReneRanzinger Please comment on the expected behavior for quoted term ("Dmel\Indy") and un-quoted term (Dmel\Indy) in simple and global search. Apart from result count time taken to perform the both quoted ("Dmel\Indy") and un-quoted term (Dmel\Indy) needs to be checked.
@rykahsay at least current behavior of 502 Proxy Error needs to be resolved. But would need @ReneRanzinger comments on actual search result behavior (count and time performance) for simple and global search.
related : #629