oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.34k stars 745 forks source link

Search API returns different results from Web UI #3239

Open ghost opened 3 years ago

ghost commented 3 years ago

Describe the bug The REST API (api/v1/search) returns different results from Web UI for the same query condition.

Environments:

To Reproduce Steps to reproduce the behavior: Searching from GUI, gets "Searched +full:google +refs:google (Results 25801 – 25802 of 25802) sorted by relevance" But searching from REST API gets

curl https://<grok-server>/api/v1/search?full=google&defs=&refs=google&path=&hist=&type=&searchall=true&start=0&maxresults=1 | python -m json.tool
{
    "time": 1170,
    "resultCount": 38082,
    "startDocument": 0,
    "endDocument": 0,
    "results": {
...

Expected behavior Web UI and API should return same results for the same search condition.

vladak commented 3 years ago

How do you run the indexer ? Do you have projects enabled ?

ghost commented 3 years ago

Both web UI and API went to the same OpenGrok instance, and using the same account. All projects were included in the search. So, this issue should have nothing to do with indexer.

vladak commented 3 years ago

There is #3170, that's why I am asking about projects and indexer.

ghost commented 3 years ago
2020-09-25 08:38:15.698+0000 INFO t1 Indexer.parseOptions: Indexer options: [
-v, --displayRepositories, off, --optimize, on, -r, uionly, -H, -S, --depth, 99, --progress, -c, /usr/bin/ctags, -o, /var/opengrok/conf/ctags/config, -m, 256, --leadingWildCards, on, -R, configuration.ro.xml, -W, configuration.xml, -P, -U, http://localhost:9080/vanilla_android, -s, /var/opengrok/stage1/src, -d, /var/opengrok/stage1/data
]
ghost commented 3 years ago

I got more results from API than web UI.

vladak commented 8 months ago

Tried to replicate this with 1.12.28 using AOSP source code and fulltext searching for 'google' (http://localhost:8080/source/api/v1/search?projects=AOSP&full=google&maxresults=200000). Using the API I got "resultCount":41556, and using the web UI I got way less - several thousands of results as reported by the webapp. Interestingly when I refreshed the first result page, the result count was almost always different. It seems to me as if it is cycling though a small set of numbers. Even more surprising was clicking through the various result pages - progressing through results pages 1, 2, 3, ... etc. the total number of results reported with each ascending page number was higher. The last page of the results, page 3810 reported 95241 of total results. On the last page the total number of results did not change when the page was refreshed. Based on this experience, I tried the API call multiple times to see if it will change, however it remained the same.

vladak commented 8 months ago

There is quite a difference how the search is done between web UI and the API. In API, the SearchController in the end uses the SearchEngine class (via the SearchEngineWrapper subclass of the SearchController class) . This class grabs the IndexSearcher (Lucene) using https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/search/SearchEngine.java#L181 (where SuperIndexSearcher is a super class wrapping IndexSearcher for the purpose of "bumping" the related IndexReader after reindex so that newly indexed data can be displayed in search results) or https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/search/SearchEngine.java#L202-L203 for project-less and project searches, respectively. The difference is that while in project-less mode the IndexSearcher is reused, with projects it is created from scratch. The query is created from the API arguments using https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/search/SearchEngine.java#L154-L160. The search results are collected using TopScoreDocCollector (Lucene). The results are then processed by SearchEngine#results() that can actually perform re-query, i.e. perform the search once again. This is also where any context is fetched from the index and source and added to the Hit objects that are then returned in a list. The search count comes from the hits length. The hits object is acquired here: https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/search/SearchEngine.java#L219

vladak commented 8 months ago

The web UI uses the SearchHelper class like so: https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-web/src/main/webapp/search.jsp#L86. The IndexSearcher is acquired in SearchHelper#prepareExec(): https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java#L400-L402 and then used in executeQuery(): https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/web/SearchHelper.java#L478-L479. The collected and summarized results are then embedded to the page: https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-web/src/main/webapp/search.jsp#L227-L228 aggregated by directory: https://github.com/oracle/opengrok/blob/b4a9940090f2c2cb8e8db97f6e7ca901455c6ed1/opengrok-indexer/src/main/java/org/opengrok/indexer/search/Results.java#L109-L110. The number of results reported near the top of the page comes from the totalHits field as visible above. Compared to how the hits are extracted for the API in the SearchEngine, there is no collector involved.

The API uses Lucene's public void search(Query query, Collector results) while the web UI uses public TopFieldDocs search(Query query, int n, Sort sort).