oracle / opengrok

OpenGrok is a fast and usable source code search and cross reference engine, written in Java
http://oracle.github.io/opengrok/
Other
4.34k stars 745 forks source link

UI and API /search return vastly different result sizes in project-less mode #3170

Open mkboudreau opened 4 years ago

mkboudreau commented 4 years ago

Describe the bug It does not seem like the API and the UI are returning the same results. Maybe I'm missing something, but when I search for the exact same thing from the UI and the API I get 125 results from the API and 9,708 from the UI. Something seems not right.

To Reproduce Try /search?... and /api/v1/search?... with all the same query parameters. In my test only full= was set.

Expected behavior I expect the UI and the REST API to return the same results

Screenshots

URL: http://local-opengrok/search?full=test&defs=&refs=&path=&type=

image

URL: http://local-opengrok/api/v1/search?full=test&defs=&refs=&path=&type=

curl -s 'http://local-opengrok/api/v1/search?full=test&defs=&refs=&path=&type=' | jq .resultCount
125
curl -s 'http://local-opengrok/api/v1/search?full=test&defs=&refs=&path=&type=' | jq 
{
  "time": 167,
  "resultCount": 125,
  "startDocument": 0,
  "endDocument": 124,
  "results": {
...
vladak commented 4 years ago

When you do the search in the UI, do you have any projects selected ? (assuming you are running with projects enabled)

mkboudreau commented 4 years ago

When you do the search in the UI, do you have any projects selected ? (assuming you are running with projects enabled)

The screenshot supplied the entire UI, there is nothing else on the screen. I do not see anything relating to "projects".

vladak commented 4 years ago

How do you run the indexer ?

mkboudreau commented 4 years ago

The indexer runs as in a container and upon completion, it bounces the web app container. The containers point to the same data and source dirs.

opengrok-indexer \
    -j /usr/bin/java \
    -J=-Djava.util.logging.config.file=/var/opengrok/logging.properties \
    -J=-Xms6g -J=-Xmx6g \
    ${JMX_OPTIONS} ${HEAPDUMP_OPTIONS} \
    -a /opt/opengrok/lib/opengrok.jar \
    -- \
    --verbose \
    --progress \
    --assignTags \
    --source /opengrok/sources \
    --dataRoot /opengrok/data \
    --renamedHistory on \
    --memory 256 \
    -i node_modules -i vendor -i *.dll -i *.so -i *.exe -i *.jar -i *.gz

Why do you think the indexer would influence a difference between the REST api and the web UI returning different results?

vladak commented 4 years ago

I was asking because of the projects. You're running the indexer with projects disabled and it might be relevant for root causing this issue.

mkboudreau commented 4 years ago

ok, thank you for the clarification. please let me know if there is anything else you need from me.

mkboudreau commented 4 years ago

@vladak any luck duplicating this issue?

vladak commented 4 years ago

Tried with simple project-less setup, could not reproduce it. Is there something special about those 125 search hits ?

mkboudreau commented 4 years ago

I just reran the queries and I cannot see anything special. It appears to be a subset, but I see a variety of file types, directory structures, etc. They all seem valid, except for there being 125 instead of ~140,000. The indexer indexes a local directory of all our organization's git repos in the format /org/repo1, /org/repo2, and so on.

Other examples I've executed from the REST endpoint (i.e. /api/v1/search?full=somesearch) are also being limited at 125.

idodeclare commented 4 years ago

I was asking because of the projects. You're running the indexer with projects disabled and it might be relevant for root causing this issue.

@vladak , you are correct: when projects are disabled there seems to be an erroneous "double paging" going on. SearchEngine in project-less configuration filters records early even though /api/v1/search also will try to do later. The number of results in the SearchEngine paging is by default numHitsPerPage * cachePages or 25 * 5 = 125.

The projects-enabled search by SearchEngine however also seems undesirably expensive in that it manifests every document found even though /api/v1/search will later filter to a page of (by default) 1000 results.

mkboudreau commented 4 years ago

@vladak given the investigation @idodeclare has done, this feels like a legitimate issue. Do you agree? What are the next steps?

mkboudreau commented 3 years ago

@vladak any update on this issue?

vladak commented 3 years ago

sorry, no bandwidth to work on this currently.

vladak commented 3 years ago

The projects-enabled search by SearchEngine however also seems undesirably expensive in that it manifests every document found even though /api/v1/search will later filter to a page of (by default) 1000 results.

This could lead to #1806 I think.