Schema Updates and Changes to existing indexing to Stage Processes:
Rename update_stage_index to index_collection and add alias as a parameter to index_collection to generalize this function.
Rename add_page to index_page, update the bulk opensearch action to be index, rather than create, and use refresh: true as a parameter to the bulk opensearch request. The index action will create opensearch documents if they don't already exist, and will overwrite opensearch documents if they do exist. The refresh: true parameter will make re-indexed documents available immediately on all shards (at an index-time cost to the cluster), in order to then run our delete by query request against the most up-to-date version of all records.
Add page, version_path, and indexed_at to records just prior to indexing. indexed_at is defined as the datetime at the start of indexing.
Move the invocation of delete_collection_records_from_index to happen just after the bulk indexing, and update delete_collection_records_from_index to first query for "outdated records" - records of the given collection ID, that don't match the given data version - before then deleting all these outdated records. The query helps us report out the versions of each outdated record.
Added the version and index to the SNS event sent to the registry at the end of the Airflow update_stage_index_for_collection_task
Migration script:
Adds values version_path: initial, indexed_at: <time migration script started>, and page: unknown to records in the index already via a re-index.
Publish Processes:
Renamed update_stage_index_for_collection_task to index_collection_task to generalize between -stg and -prd index aliases.
Added stage_collection_task and publish_collection_task, which both call index_collection_task with a different alias.
Created a publish_collection DAG
Pooling:
Specifies that all actions hitting OpenSearch should run in the rikolti_opensearch_pool. Since we just have one cluster across all stage and prod indices, any and all Airflow tasks hitting OpenSearch should by added to this pool. We can configure the pool using the Airflow UI, and should monitor the OpenSearch cluster's performance using the CloudWatch Dashboard.
Developer Candy:
Adds a dashboard for developer ease to the docker-compose file for the record_indexer, in order to run the record_indexer locally, must set OPENSEARCH_IGNORE_TLS=True in the environment.
Adds an initialization script to add the rikolti-stg and rikolti-prd aliases to a new opensearch cluster (as one would get when running a new docker compose).
Schema Updates and Changes to existing indexing to Stage Processes:
update_stage_index
toindex_collection
and addalias
as a parameter toindex_collection
to generalize this function.add_page
toindex_page
, update the bulk opensearch action to beindex
, rather thancreate
, and userefresh: true
as a parameter to the bulk opensearch request. Theindex
action will create opensearch documents if they don't already exist, and will overwrite opensearch documents if they do exist. Therefresh: true
parameter will make re-indexed documents available immediately on all shards (at an index-time cost to the cluster), in order to then run our delete by query request against the most up-to-date version of all records.page
,version_path
, andindexed_at
to records just prior to indexing.indexed_at
is defined as the datetime at the start of indexing.delete_collection_records_from_index
to happen just after the bulk indexing, and updatedelete_collection_records_from_index
to first query for "outdated records" - records of the given collection ID, that don't match the given data version - before then deleting all these outdated records. The query helps us report out the versions of each outdated record.version
andindex
to the SNS event sent to the registry at the end of the Airflowupdate_stage_index_for_collection_task
Migration script:
version_path: initial
,indexed_at: <time migration script started>
, andpage: unknown
to records in the index already via a re-index.Publish Processes:
update_stage_index_for_collection_task
toindex_collection_task
to generalize between -stg and -prd index aliases.stage_collection_task
andpublish_collection_task
, which both callindex_collection_task
with a different alias.publish_collection
DAGPooling:
rikolti_opensearch_pool
. Since we just have one cluster across all stage and prod indices, any and all Airflow tasks hitting OpenSearch should by added to this pool. We can configure the pool using the Airflow UI, and should monitor the OpenSearch cluster's performance using the CloudWatch Dashboard.Developer Candy:
OPENSEARCH_IGNORE_TLS=True
in the environment.rikolti-stg
andrikolti-prd
aliases to a new opensearch cluster (as one would get when running a new docker compose).