Closed m453h closed 1 week ago
Some thoughts/questions:
query_string
instead of DSL? It might make it (too?) easy to use?!original_url
of http://mediacloud.org/need_canonical_domain
would eliminate themSome thoughts/questions:
- Would it be possible/practical to take a
query_string
instead of DSL? It might make it (too?) easy to use?!- Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for
original_url
ofhttp://mediacloud.org/need_canonical_domain
would eliminate them- The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??
Hey @philbudne interesting thoughts and questions:
On adding query string: I believe I can modify the script so that we could support both DSL
and query_string
, the user would only pass the query_string as:
original_url:"http://mediacloud.org/need_canonical_url" AND canonical_domain:"blogdomago.com"
But we would construct the query as:
"query": {
"query_string": {
"query": "original_url:\"http://mediacloud.org/need_canonical_url\" AND canonical_domain:\"mediacloud.org\""
}
}
With this validation, execution would just remain the same without needing a lot of modification. Let me work on this and push the changes.
Regarding the example: The script accepts query as an argument so we can always change the query as we see fit to match the documents that will need to be updated at the time of executing the script.
Regarding the use of Scroll API: I did see the warning, regarding scroll API, I did a bit of more research before the implementation in most cases I see the issue with scroll API boils down to maintaining the search context on the server which is costly.I went with Scroll API since:
I'm open to making changes and utilize search_after.
This PR introduces a script for updating incorrect canonical domains for documents stored in Elasticsearch.
The implementation of this script uses:
The scroll API for the retrieval of documents in batches with configurable batch size.
The Bulk Helpers to perform batch updates across multiple documents simultaneously, significantly reducing the number of requests sent to Elasticsearch.
Example of the usage of the script :
Addresses #345
====
Update
The following improvements have been done following observations from @philbudne :
query_string
has been added thus the script can also be called as follows: