mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

chore: ES reindex implementation #343

Closed thepsalmist closed 2 weeks ago

thepsalmist commented 1 month ago

This PR addresses Issue #340

kilemensi commented 1 month ago

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script? i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query, ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.
thepsalmist commented 1 month ago

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script? i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query, ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.

On slicing, Defaults to 1, meaning the task isn't sliced into subtasks. I think its better to set this to auto & let ES chose a reasonable number for most indices.

thepsalmist commented 3 weeks ago

My (very) quick thoughts:

  1. Won't the script be better in Python (it could be just me and my sh allergies)?
  2. How do we plan on using this script? i. Would we want to use it to reindex documents matching a certain query? In that case, the script should support a query, ii. Do we care about any impact running this script could have on the cluster?, Then the script should support throttling or slicing, etc.

On the support for throttling, I'm not sure we wan't to define the requests_per_second to set throttling on initial reindex request without knowing the impact on the cluster. Since we've settled on Kibana being part of the cluster, I believe we can set this using the Rethrottle API based on the monitoring stats