sonatype / nexus-public

Sonatype Nexus Repository Open-source codebase mirror
https://www.sonatype.com/products/repository-oss-download
Eclipse Public License 1.0
1.84k stars 557 forks source link

Search query limited to 10000 records #357

Open tanganellilore opened 4 months ago

tanganellilore commented 4 months ago

Hi team,

i notice a Bug on search query, probably is connected to this issue and this old not migrated issue https://issues.sonatype.org/browse/NEXUS-16917.

I notice that if I perform a search query on a large repository, with more than 10000 elements but with pagination, i recevie a error from api like this:

RemoteTransportException[[159FCCBA-DE3F55B4-695C3AB7-3D759962-AA738D59][local[1]][indices:data/read/search[phase/query]]]; nested: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [10050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]; }

Log above is an example, but is very long and repated.

Some suggestion to solve it?

Thanks

P.S. repo have multiple folder with a lot of docker images, so i need to extract all of them.

tanganellilore commented 3 months ago

probably releated to this: https://github.com/sonatype/nexus-public/blame/26b9f7155c65c503129f6c6fdf2610d21b8e80be/components/nexus-repository-services/src/main/java/org/sonatype/nexus/repository/search/query/ElasticSearchQueryServiceImpl.java#L100

nblair commented 3 months ago

Thanks for opening an issue @tanganellilore. The limit applied to search responses is intentional - such large datasets don't scale well for a system with an embedded database and embedded search engine. Without that in place, it's a recipe for OOM, which can cause the application to fail unexpectedly and result in database/index corruption and/or data loss.

What is your use case for queries that have such large result sets?

elmbrain commented 3 months ago

We have the same problem. The repository contains many artifacts and a search is needed for them all. Users have the right to decide how to limit the output. Previously, the index.max_result_window parameter in the elasticsearch configuration file worked. And it was a revelation to us that it was broken. It’s unclear why I hardcoded the parameter directly in the code. Set the parameter at the configuration file level so that it can be changed.

tanganellilore commented 3 months ago

Thanks for opening an issue @tanganellilore. The limit applied to search responses is intentional - such large datasets don't scale well for a system with an embedded database and embedded search engine. Without that in place, it's a recipe for OOM, which can cause the application to fail unexpectedly and result in database/index corruption and/or data loss.

What is your use case for queries that have such large result sets?

Hi @nblair , I notice only now the answer, sorry for delay. I need simply export all "metadata" of all assets in all repos and subfolder, like checksum, last download etc.... and save in a external DB, to track changes and delation for internal process, (without usage of audit webhook).

In my case i have a big repo with a lot of subfolder, all of them with like 30/100 assets. So this repo is bigger than 10k elements. Via api, we can't get simply a list of subpath in the repo, to iterate over it (to reduce number of assets), so this is why i receive this error on call.

I understood that this limit is to avoid OOM, but with api we can't do it.

For my use case i used a groovy script that can be called and return this type of object per repository, but i notice also this that we have warning.