Implement canonical domain update script

m453h commented 3 weeks ago

This PR introduces a script for updating incorrect canonical domains for documents stored in Elasticsearch.

The implementation of this script uses:

The scroll API for the retrieval of documents in batches with configurable batch size.
The Bulk Helpers to perform batch updates across multiple documents simultaneously, significantly reducing the number of requests sent to Elasticsearch.

Example of the usage of the script :

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='{                                                                                                                        
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "canonical_domain": "mediacloud.org"
                    }
                }
            ]
        }
    }
}'

Addresses #345

====

Update

The following improvements have been done following observations from @philbudne :

To address concerns regarding the usage of scroll API the implementation has been updated to use search_after to retrieve documents that should be updated.

Support for query_string has been added thus the script can also be called as follows:

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='canonical_domain:mediacloud.org AND original_url:"http\://mediacloud.org/need_canonical_url"' --type=query_string

philbudne commented 2 weeks ago

Some thoughts/questions:

Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!
Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them
The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??

m453h commented 2 weeks ago

Some thoughts/questions:

Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!

Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them

The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??

Hey @philbudne interesting thoughts and questions:

On adding query string: I believe I can modify the script so that we could support both DSL and query_string, the user would only pass the query_string as: original_url:"http://mediacloud.org/need_canonical_url" AND canonical_domain:"blogdomago.com"

But we would construct the query as:
```
"query": {
       "query_string": {
            "query": "original_url:\"http://mediacloud.org/need_canonical_url\" AND canonical_domain:\"mediacloud.org\""
          }
 }
```
With this validation, execution would just remain the same without needing a lot of modification. Let me work on this and push the changes.
Regarding the example: The script accepts query as an argument so we can always change the query as we see fit to match the documents that will need to be updated at the time of executing the script.
Regarding the use of Scroll API: I did see the warning, regarding scroll API, I did a bit of more research before the implementation in most cases I see the issue with scroll API boils down to maintaining the search context on the server which is costly.I went with Scroll API since:
- The script would be executed once as a maintenance task, meaning we avoid the risk of having too many search contexts open. Based on the documentation, it states: "Scrolling is not intended for real-time user requests, but rather for processing large amounts of data, such as reindexing the contents of one data stream or index into a new data stream or index with a different configuration."
- Scroll API is provided by the scan helper out of the box which seems to best fit this scenario.
I'm open to making changes and utilize search_after.

mediacloud / story-indexer

Implement canonical domain update script #348

Update