A more precise architecture diagram

Here is a diagram I have arrived at after multiple discussions with Ronald to clarify the exact use cases and the kinds of datasets we deal with (pardon the hand-drawn look):

74BED21D-FA66-4D1E-AD5A-AEF1298278B1

Adding it here for reference, to ensure we are on the same page.

This implies we will have two Django project codebases:

One can run, if needed (and as per the original diagram), in multiple instances behind CDN/LB. It would pass requests on to ES or PGSQL, and return results, not itself effecting any change. This is the bibxml-service.
The other one must be run in a single instance only. This one will host the async task runners that handle indexing. It makes things a bit less complex and more reliable if we host and supervise multiple workers (Celery) within the bounds of a single server, indexing workloads we’re dealing with should be fine with this approach. (Scaling across multiple instances, if proves to be needed, can be done further down on Celery level.) This will reside in the new bibxml-indexer repository.

Currently this notably does not cover the processes that fetch external datasets from their respective third-party locations into easier to access forms, such as GitHub repositories.

After that, converting those datasets from heterogenous formats into consistent Relaton structures (and storing them in ES for search and PGSQL for querying by reference) is taken care by the indexer, which will include pluggable adapter modules to fetch and parse each dataset.

An open issue is that requests to describe a DOI standard will require an extra network trip to DOI endpoint, meaning we can time out due none of our fault if that takes too long, and furthermore we should implement throttling on our side and proactively time out in some cases to avoid unintentionally DoSing DOI endpoint.

ietf-ribose / bibxml-project

A more precise architecture diagram #2