rearchitect indexes to support archiving?

rahulbot commented 10 months ago

For discussion (in connection to #198 and #168): Via @phil on slack, who asks good questions about the decision to architect indexes by publication year (#67). An alternative would be to freeze indexes after some time, or at some certain size (story count or KB), so we're constantly appending to the latest index, and all prior ones are frozen and backed up as stacked snapshots.

A quick glance at ES lifecycle management suggests there might be built in support for managing this automatically.

For discussion.

thepsalmist commented 10 months ago

Background context here

Unlike in the current setup, the hot-warm architecture should use hot-node for ingestion (We would need at least 2 nodes configured as hot-nodes for high availability)
Search is parallelized across both the hot and warm nodes. Therefore at least 2 nodes would be needed for high availability.
Based on Mediacloud’s use case do we foresee a scenario where we rollover some indices to the cold/frozen tiers, bearing in mind that there will always be search. So do you have an opinion on whether we should implement a cold & or frozen node now?
Our initial ingestion will have mixed data from both historical years (pre-2023) and current year 2024, and as we progress the indices should have almost linear data based on publication years. This should affect any search/aggregations based on publication dates/year, since this implies searching across all indices rather than a specific index_year as currently exists. Context: Assuming our rollover is based on max shard_size of 50GB, when we start re-indexing the first 5-10 indices will contain mixed data (based on publication dates). A query such as qs=(NBA)%2520-(Basketball)%2C*&start=01-02-2023%2C12-09-2023&end=01-11-2023 means searching across all the 5-10 indices?
Based on Q 1 & 2 above, In addition the current hardware that we have available (Ramos, Bradley & Woodward), do we have extra hardware to support a scenario where we set a minimum of two(2) hot/content nodes two(2)warm nodes, and if neededcold & frozen` nodes?

rahulbot commented 10 months ago

Weighing in on 5: tarbell (MC web server) is the only only server that we have with large attached storage (187GB RAM, 18TB Raid5 attached). That would give us 4 dedicated DB servers, if we migrate web/rss-fetcher to something else. Options for that include steinam (Evan's staging server?) or lowery (unused?) or posey (CFA dev?). Those have 3.5 TB storage internally each which might be fine for web / RSS-fetcher needs.

philbudne commented 10 months ago

Can we just run three hot servers?

I don't think migration is meaningful for us.

thepsalmist commented 10 months ago

We could run three hot nodes as mentioned above. But looking at other implementations and suggestions, this points to a bad search perfomance. Example here

philbudne commented 10 months ago

My #1 concern is that the current hardware is that the current hardware needs to be treated as suspect, and having only two nodes on any tier seems like an invitation to an emergency situation (one node on a tier).

Questions that come to mind (I realize there may not be ready answers):

Is there any indication that the current (three nodes, all writers) configuration a performance disaster waiting to happen?

Without new hardware, how far into the future can we go (after reloading data back to 2008)? In terms of storage (on a single three node tier), I seem to recall the answer was a few years? Can we make any prediction of what search performance might be like? Can any predictions be made for one three-node tier vs. two two-node tiers?

How quickly are we likely to be making new indices, and archiving the old ones? If we're closing new indices (and archiving old ones) quickly, that might alter our perspective on risk.

What prospects (if any) exist for repair of existing arrays, should they fail?

What would ideal hardware (space for growth, and/or a plan on how to expand for growth) look like?

What is the minimum time range for data that researchers consider useful?

And finally, if "cold" storage is in the mix: What data amount/age (if any) would be it acceptable to have "less easily/quickly searchable"?

thepsalmist commented 10 months ago

To answer some of the questions raised above, considering the data size that we've ingested for 2023.

Storage Qns

Looking at the size of each of the shards for the 2023 index (primary & replica) totals to ~=366GB. This was for 3 months (so an educated guess of the total years data would be ~1.4TB). Total provisioned storage across Ramos, Bradley & Woodward ~= 162 TB (70+70+22)

Screenshot 2024-01-17 at 16 18 05

Is there any indication that the current (three nodes, all writers) configuration a performance disaster waiting to happen?

Not really, the current configuration all nodes (master & data) nodes provides for availability by automatically distributing the shards and replicas across each of the nodes, as in screenshot above. Two copies of the same shard (primary & replica) should not be on same node Worst case scenario at minimum 2 nodes should be up at a time to allow for master-election. In such a case Elasticsearch would distribute our shards based on the available nodes

mediacloud / story-indexer

rearchitect indexes to support archiving? #205