The current experimental index is a one-shot, based on a 2020-08 export of fatcat release entities. Of course we want updates to flow from fatcat to the scholar index in the same way that entity updates currently flow to the fatcat metadata search index.
The rough plan for this feature is:
[x] two new Kafka topics: one for work identifiers needing re-indexing, and one for "heavy intermediate" JSON objects (ready for transform and indexing)
[x] changes to fatcat entity updater to find all work identifiers that need to be updated as a result of an editgroup, and publish these to the needs-updating queue. important that these are de-duped within the editgroup. we already grab all the releases updated (eg, even for new file creation), so this should be easy
[x] new worker (consume+publish) to generate heavy intermediate objects per work identifier
[x] new worker (consume) to transform heavy intermediate to elastic schema and send to the index. should probably work in batches of 50+ documents at a time
Because scholar/fulltext index updates are relatively expensive compared to regular fatcat entity index updates, we might want to consider some optimizations:
match kafka partitioning/sharding to elasticsearch partitioning/sharding (eg, based on work identifier) to minimize cross-index updates in elasticsearch
some "linger" delay between a work update and index update, in case the work is rapidly updated again. considering all edits within an editgroup will catch some of this, but for example the daily new arxiv imports will result in new release entities, then just an hour or two later a new file entity, both of which will result in work-level update requests
checking difference between old and new document at index time, and not re-indexing if nothing has changed. not sure if we can pull the "body" field from index (should not be publicly possible at least), but can infer that content from fulltext metadata. this would save re-indexing churn for some edits
The current experimental index is a one-shot, based on a 2020-08 export of fatcat release entities. Of course we want updates to flow from fatcat to the scholar index in the same way that entity updates currently flow to the fatcat metadata search index.
The rough plan for this feature is:
Because scholar/fulltext index updates are relatively expensive compared to regular fatcat entity index updates, we might want to consider some optimizations: