mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Ft/elasticsearch ilm #211

Closed thepsalmist closed 10 months ago

thepsalmist commented 10 months ago

Implementation of managing Elasticsearch Indexes through an ILM policy

  1. We introduce a conf/elaticsearch directory to add our configuration files
  2. Indexing at the importer now done to an alias that maps to all indices created by the ILM & index template (discarded year based index creation & routing)
  3. The Elasticsearch setup is done via an elasticsearch-conf service (built from common worker dockerfile)
philbudne commented 10 months ago
comments/questions on pr#211

Does the recent discovery that text content is currently limited to
under 32K need to be addressed?

== docker/docker-compose.yml.j2 line 64

The only place I see use of ELASTICSEARCH_INDEX_NAME_ALIAS is in importer.py, so
why pass it to all containers, and not just importer-worker?

But the value "mc_search" is wired into
conf/elasticsearch/create_initial_index.json and
conf/elasticsearch/create_index_template.json, so having it be a
configuration item will give someone someone the mistaken idea that it
can be easily changed!  Maybe make INDEX_NAME_ALIAS a global constant
variable in importer and note that the name is wired into the above
conf files?

If ELASTICSEARCH_SHARDS and ELASTICSEARCH_REPLICAS are no longer used
by importer.py then can all of the following can go away?

docker/deploy.sh:267:    ELASTICSEARCH_IMPORTER_REPLICAS=1
docker/deploy.sh:268:    ELASTICSEARCH_IMPORTER_SHARDS=30
docker/deploy.sh:287:    ELASTICSEARCH_IMPORTER_REPLICAS=1
docker/deploy.sh:288:    ELASTICSEARCH_IMPORTER_SHARDS=5
docker/deploy.sh:313:    ELASTICSEARCH_IMPORTER_REPLICAS=0
docker/deploy.sh:314:    ELASTICSEARCH_IMPORTER_SHARDS=2
docker/deploy.sh:546:add ELASTICSEARCH_IMPORTER_REPLICAS int
docker/deploy.sh:547:add ELASTICSEARCH_IMPORTER_SHARDS int

docker/docker-compose.yml.j2:343:      ELASTICSEARCH_REPLICAS: {{elasticsearch_importer_replicas}}
docker/docker-compose.yml.j2:344:      ELASTICSEARCH_SHARDS: {{elasticsearch_importer_shards}}

Does this mean all deployments (ie; development and staging) now get
30 shards and 2 replicas?

Looking at conf/elasticsearch/create_index_template.json I see 30
shards and 30 replicas????

conf/elasticsearch/create_index_template.json:8:      "number_of_shards": 30,
conf/elasticsearch/create_index_template.json:9:      "number_of_replicas": 30

== indexer/scripts/elastic-conf.py

Can you add an explanation in the initial docstring why this is now a
separate program?  What happens to any importer processes if
elastic-conf hasn't run first?  It would be preferable to check and
sleep for a minute than die with an exception that sentry.io picks up
and broadcasts to the slack alerts channel.

Multiple importers may run in some configurations (especially archive
and historic ingest), so having only one program doing configuration
might make for less confusion.

> logger = getLogger("elastic-stats")
Needs to be updated?

> index_template_path = f"{ELASTICSEARCH_CONF_DIR}/create_index_template.json"
> ilm_policy_path = f"{ELASTICSEARCH_CONF_DIR}/create_ilm_policy.json"
> initial_index_template = f"{ELASTICSEARCH_CONF_DIR}/create_initial_index.json"

I think it's a better habit to use os.path.join
ie; os.path.join(ELASTICSEARCH_CONF_DIR, "file_name.json")
than string formatting to create paths.

> def read_file(self, file_path: str) -> Union[dict, Any]:
>    with open(file_path, "r") as file:
>        data = file.read()
>    return json.loads(data)

could be:
> def read_file(self, file_path: str) -> Union[dict, Any]:
>    with open(file_path, "r") as file:
>        return json.load(file)

At the end:
>    app = ....
>    ...
>    app.main()

Can now use:
from indexer.app import run
....
    run(AppClass, "program", "description")

== importer.py

I think all instances of "Mapping[" can be replaced with "Dict[" and
it would make things cleaner and clearer (no need to make copies to
change an item).

Is there any reason to keep ElasticsearchConnector as a class?  It has
one method, which simply calls client.create without making any
changes to the values.

line 78: logger.warning(f"Value for key '{key}' is not provided.")

It's unlikely that warnings will be disabled, but it's preferable to
always use logger.level("format with %s and %d", var1, var2) because
the formatting can be avoided if the logger has level "level"
disabled.

line 81:
keys_to_skip could be a global set (set at top of file)
KEYS_TO_SKIP = {"is_homepage", "is_shortened"}

line 106:
    url_hash = hashlib.sha256(url.encode("utf-8")).hexdigest()
Please use the new function from mcmetadata.

lines 110-116

Thanks for commenting out these lines!!!  It made it clear the the
change that added them was ONLY effecting the string being passed to
the "index_routing" method, and not the metadata being put into ES!!!

Because the importer can be passed archived data that has different
"shapes" (thanks to Rahul for the term), including Story objects being
replayed from WARC files (as we're going to do to restore the data),
we need to handle any and all configurations of values that might lurk
in the archive files.

Here is my suggestion for replacing lines 110-116:

    old_pub_date = data.get("publication_date")
    # must handle "None" from archives!
    if old_pub_date is None or old_pub_date == "None":
    # here without publication_date from parser
    # use data from RSS entry, available for about 89% of stories
    rss_pub_date = story.rss_entry().pub_date
    if rss_pub_date:
        # generated by rss-fetcher, should always be RFC-(2)822 format, UTC
        # let exception cause retry and quarantine
        pub_date = datetime.strptime(rss_pub_date, "%a, %d %b %Y %H:%M:%S %z").strftime("%Y-%m-%d")
    else:
        pub_date = None
    if old_pub_date != pub_date:
        data["publication_date"] = pub_date
        with content_metadata:
        content_metadata.publication_date = pub_date

    # force UTC, ONLY the date!
    data["indexed_date"] = datetime.utcnow().strftime("%Y-%m-%d")

I'm hoping that last line will change to something like:

    indexer_metadata = story.indexer_metadata()
    if indexer_metadata.indexed_date:
    indexed_date = indexer_metadata.indexed_date  # from archive file
    else:
    indexed_date = datetime.utcnow().strftime("%Y-%m-%d")
    data["indexed_date"] = indexed_date
    with indexer_metadata:
    indexer_metadata.indexed_date = indexed_date
    indexer_metadata.url_hash = url_hash

That is, use "indexed_date" from archive/historic data files and pass
importer generated metadata to archiver.
thepsalmist commented 10 months ago

If ELASTICSEARCH_SHARDS and ELASTICSEARCH_REPLICAS are no longer used by importer.py then can all of the following can go away?

Yes, since these settings are baked in the index creation template, then these config variables are going away.

Looking at conf/elasticsearch/create_index_template.json I see 30 shards and 30 replicas????

We intend to have 30 shards each with 1 replica, so 30,1 is correct

philbudne commented 10 months ago

I appreciate the update to allow different configurations depending on deployment type, but I think having three sets of largely identical files is at best a headache (having to perform edits three times, having the files diverge in small ways so that diff'ing them isn't easy), or a nightmare (people forgetting to edit one set of files, bitrot leading to breakage)., and that in the end (but NOT necessarily at this moment, since time is of the essence) having a single set of "template" files used to generate three sets of configs is a better approach. BUT I can easily imagine that Jinja2 and JSON don't mix well (two different uses of curly braces!), and there is the question of WHEN to expand the templates: in the Dockerfile? When/before elastic-conf runs??

kilemensi commented 10 months ago

I appreciate the update to allow ...

Yeah, it will be nice if we can keep it DRY.

A wild thought: Can't we have just 3 python dicts that we update/change via args/env var similar to what we're doing already? The only variables that differ between environments are:

Policy

  1. max_primary_shard_size
  2. max_age

Template

  1. number_of_shards
  2. number_of_replicas

Index

Nothing?


I'm assuming the deploy script can provide the above values to elastic-conf per environment.

kilemensi commented 10 months ago

Looking at the importer @thepsalmist, do we even need ElasticsearchConnector anymore? Using the client directly may be all we need:

                response = self.connector.index(url_hash, target_index, data)

vs

                response = self.client.create(target_index, url_hash, document=data)
thepsalmist commented 10 months ago

Looking at the importer @thepsalmist, do we even need ElasticsearchConnector anymore? Using the client directly may be all we need:

                response = self.connector.index(url_hash, target_index, data)

vs

                response = self.client.create(target_index, url_hash, document=data)

Resolved