mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Implement canonical domain update script #348

Closed m453h closed 1 week ago

m453h commented 3 weeks ago

This PR introduces a script for updating incorrect canonical domains for documents stored in Elasticsearch.

The implementation of this script uses:

Example of the usage of the script :

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='{                                                                                                                        
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "canonical_domain": "mediacloud.org"
                    }
                }
            ]
        }
    }
}'

Addresses #345

====

Update

The following improvements have been done following observations from @philbudne :

./bin/run-elastic-update-canonical-domain.sh --elasticsearch-hosts  http://localhost:9200 --index=mc_search-000001 --buffer=2000 --batch=1000 --query='canonical_domain:mediacloud.org AND original_url:"http\://mediacloud.org/need_canonical_url"' --type=query_string
philbudne commented 2 weeks ago

Some thoughts/questions:

  1. Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!
  2. Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them
  3. The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??
m453h commented 2 weeks ago

Some thoughts/questions:

  1. Would it be possible/practical to take a query_string instead of DSL? It might make it (too?) easy to use?!
  2. Regarding the example, there are a few mediacloud.org pages in the index (for whatever reason), so also querying for original_url of http://mediacloud.org/need_canonical_domain would eliminate them
  3. The scroll API page you reference says to avoid using it for more than 10K hits; is this something we can safely ignore??

Hey @philbudne interesting thoughts and questions:

  1. On adding query string: I believe I can modify the script so that we could support both DSL and query_string, the user would only pass the query_string as: original_url:"http://mediacloud.org/need_canonical_url" AND canonical_domain:"blogdomago.com"

    But we would construct the query as:

    "query": {
           "query_string": {
                "query": "original_url:\"http://mediacloud.org/need_canonical_url\" AND canonical_domain:\"mediacloud.org\""
              }
     }

    With this validation, execution would just remain the same without needing a lot of modification. Let me work on this and push the changes.

  2. Regarding the example: The script accepts query as an argument so we can always change the query as we see fit to match the documents that will need to be updated at the time of executing the script.

  3. Regarding the use of Scroll API: I did see the warning, regarding scroll API, I did a bit of more research before the implementation in most cases I see the issue with scroll API boils down to maintaining the search context on the server which is costly.I went with Scroll API since:

    • The script would be executed once as a maintenance task, meaning we avoid the risk of having too many search contexts open. Based on the documentation, it states: "Scrolling is not intended for real-time user requests, but rather for processing large amounts of data, such as reindexing the contents of one data stream or index into a new data stream or index with a different configuration."
    • Scroll API is provided by the scan helper out of the box which seems to best fit this scenario.

    I'm open to making changes and utilize search_after.