glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

special characters in global and simple search #946

Open sujeetvkulkarni opened 10 months ago

sujeetvkulkarni commented 10 months ago

Searching for the gene symbol "Dmel\Indy" in global search and protein simple search ->

Production behavior ->

  1. Simple Search:

https://api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"\"Dmel\Indy\"","term_category":"any"}

result count from list api : 16853 Results

  1. Global Search:

https://api.glygen.org/globalsearch/search?query= {"term":"\"Dmel\Indy\""}

result :

Status Code: 502 Proxy Error

beta behavior ->

  1. Simple Search:

https://beta-api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"\"Dmel\Indy\"","term_category":"any"}

search_simple api returns below response 0 result count.

{
    "list_id": "",
    "query": {
        "operation": "AND",
        "query_type": "protein_search_simple",
        "term": "\"Dmel\\Indy\"",
        "term_category": "any"
    },
    "resultcount": 0
}
  1. Global Search:

https://beta-api.glygen.org/globalsearch/search?query= {"term":"\"Dmel\Indy\""}

result :

Status Code: 502 Proxy Error


Searching for the gene symbol Dmel\Indy in global search and protein simple search ->

Production behavior ->

  1. Simple Search:

https://api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"Dmel\Indy","term_category":"any"}

result: {"list_id": ""}

  1. Global Search:

https://api.glygen.org/globalsearch/search?query= {"term":"Dmel\Indy"}

result:

{
    "exact_match": [],
    "other_matches": {
        "total_match_count": 9602,
        "protein": {
            "all": {
                "list_id": "91037b4dfa0335709ce19654d9c6c2e0",
                "count": 4654
            }
        },
        "glycoprotein": {
            "all": {
                "list_id": "2f83dab09bafd85691a5c13c3b947cfc",
                "count": 147
            }
        },
        "glycan": {
            "all": {
                "list_id": "",
                "count": 0
            }
        }
    }
}

beta behavior ->

  1. Simple Search:

https://beta-api.glygen.org/protein/search_simple?query= {"operation":"AND","query_type":"protein_search_simple","term":"Dmel\Indy","term_category":"any"}

result:

{
    "list_id": "",
    "query": {
        "operation": "AND",
        "query_type": "protein_search_simple",
        "term": "\"Dmel\\Indy\"",
        "term_category": "any"
    },
    "resultcount": 0
}
  1. Global Search:

https://beta-api.glygen.org/globalsearch/search?query= {"term":"Dmel\Indy"}

result: Status Code: 502 Proxy Error

@ReneRanzinger Please comment on the expected behavior for quoted term ("Dmel\Indy") and un-quoted term (Dmel\Indy) in simple and global search. Apart from result count time taken to perform the both quoted ("Dmel\Indy") and un-quoted term (Dmel\Indy) needs to be checked.

@rykahsay at least current behavior of 502 Proxy Error needs to be resolved. But would need @ReneRanzinger comments on actual search result behavior (count and time performance) for simple and global search.

related : #629

ReneRanzinger commented 10 months ago

I think both (quoted/unquoted) should yield the same result. Dmel\Indy should in both cases be treated as "one word" to search for. If the user adds quotes to it, its just redundant and should be ignored.

However, if the search term contains a space "Dmel Indy" that is different. Unquoted its two search words and I think Robels search sees them as OR. Quoted its "one word" and should only show entries that contain that exact phrase.

Question for @rykahsay what happens is two words are entered? Are they searched using OR? Do we support quote?

Also simple and global search should show the same results (at least for the molecule class that simple search belongs too).

sujeetvkulkarni commented 10 months ago

few more examples,

doi ids in global search: 10.3390/v13040551 10.1016/j.celrep.2021.109179 10.1074/mcp.RA120.002295

also return Status Code: 502 Proxy Error

rykahsay commented 2 months ago

Again, I am assuming Sean will experiment with this following our face to face discussion

ReneRanzinger commented 2 months ago

@seankim658 @rykahsay what is the status on this?

seankim658 commented 2 months ago

Right now every search term is passed through this helper function (to my knowledge) before the corresponding mongodb query is created.

In mongodb there are three different search behaviors around multi-word text searches:

  1. If the words are enclosed in quotes (i.e. "word1 word2") then mongodb will only search for exact matches to that phrase.
  2. If the words are not enclosed in quotes (i.e. word1 word2) then mongodb will do an AND search looking for records that contain both of those words.
  3. If each word is individually quoted (i.e. "word1" "word2") then mongodb will do an OR search.

Based on the helper function that pre-processes the search terms, the second case will never happen. If the search term is already fully enclosed in double quotes, it is returned as is (first case). If the search term is not fully enclosed in double quotes, parenthesis and brackets are escaped, hyphens are removed completely, and then each word is individually enclosed in quotes (third case).

In mongodb both the AND and OR style text searches can have decently significant overhead. For the OR search, each search term is searched for independently, meaning that both searches have to do an index scan. For example, if the first search for word1 finds that record 500 has word1, the second index scan searching for word2 will still have to check that record for the existence of word2 (even though its already been identified by the first search that it should be apart of the result set). The two result sets are then merged and duplicate records are filtered out, adding more overhead. The AND search is even less efficient requiring a second pass to perform the result set intersection during merging.

Regarding scenarios 1 & 2 from Sujeet's original comment:

Since the search term is already enclosed fully by double quotes, it goes through no pre-processing and is used as is. I don't have GlyGen server access so I can't check the logs but I'm pretty sure what is happening is the difference in query type for simple search vs search. The /search endpoints use a mongodb text search, and the \I is being interpreted as an invalid escape sequence and that is why the 502 error is happening. The /search_simple endpoints use a regex search. The regex engine I think is treating the \ as a literal since it determines \I is not a part of a valid escape sequence.