Update data layout in S3

smorrison commented 5 years ago

In anticipation of large entry types, we need to revisit how we deal with S3 (in terms of how we manage the data for entries/revisions).

All entries in S3 are data items with keys that match a template, which is <entry type>/<uid>/rev<n>/.... A new revision will contain all of the data for a revision. This was fine when we had a few hundred entries of text (roughly 50MB per revision). As we move to ingesting more data we need to consider the amount of waste that we're generating.

It seems intuitively obvious (i.e. not necessarily obvious or true) that we duplicate a lot of data with typical uploads. Past the initial upload of an entry, most revisions look very much like the previous revision.

I would like a system where any given revision only contains the files that have changed in that revision (relative to the previous revision). I imaging that the key structure of the entries would look very much like what it is today, only that the entry revisions would be "sparse" in that much of their content is actually held in a prior revision.

The S3 data is accessed by 3 different systems independently, so any change to the layout should require coordinated releases in those 3 systems. All are written in python (python 2 and 3).

I believe that we need to do the following:

[ ] determine exactly how much waste there is in the system. Do we expect more or less the same amount in % going forward with video, et al. What's the actual $ cost, how will that change over time.
[ ] assuming that we need to maintain multiple systems that access the S3 data, how do we manage their access. Determine what features are required for accessing S3. Write a portable module that provides a way of accessing the S3 system.
[ ] document the "new" structure, whatever that may be. We have external partners that access S3 directly (ABS/api.bible) that will need to be made aware of how things are structured.
[ ] revisit how entry data is ingested. The classic uploader takes a copy of all entry data and moves it into S3, a revised system shouldn't copy everything. The new uploader doesn't require the client to submit all data, but will bulk copy absent data from the previous revision into the current at job commit time. A revised system wouldn't require that.
[ ] plan and execute a "migration" to manipulate the existing data into the new structure. This will involve deleting a lot of duplicate data and we will have to manage timelines and expectations with our external clients.
[ ] the module that offers an S3 API to our various systems should also have a CLI interface for querying/updating the data store. We'd like to see some ability to see statistics for an entry (how much data total, last time a resource changed, frequency of change, etc.) but also CLI access (e.g. resolve urls for resources in an entry/revision).

mvahowe commented 5 years ago

Some initial thoughts:

This is precisely the kind of change that seems to me to be hard to manage if third parties are accessing S3 directly. Supporting an S3 model with redirection for third parties seems to me to be quite difficult.
A robust system of this type need to support CRUD, where update and delete are the hardest. Theoretically, we don't do either, but we have been required to remove entries and revisions in the past, and developers have got into the habit of working directly with S3. The fun starts if, eg, resource X points to resource Y because it has the same checksum and then someone helpfully "fixes" resource Y.
We need a really good data integrity model, because if, eg, the links get scrambled, it will be pretty much impossible to fix manually.
Rather than structuring things as present, only with gaps/pointers for duplicated data, I think it would be better to store resources with a primary key of entry/resource/checksum, ie resources are no longer stored by revision at all. I think this would make for more orthogonal code.

klassenjm commented 5 years ago

This is not in any way to dismiss or sideline Mark's concern about partners accessing S3 directly. I just want to mention that the only party I am aware of doing this (or having been given keys to do this) is API.Bible. It was done for the reasons Bryce pointed out -- that in order to present API.Bible as a resource which DBL license holders could work through reliably, it needed a very efficient manner of remaining up to date with all DBL data. DBL itself and its API could never handle that load in its current architecture.

We need to maintain API.Bible as a dependable end-user application API for DBL data. We need a means of maintaining this which has integrity with how we want to manage DBL data.

mvahowe commented 5 years ago

Briefly, my issue with the justification for Bryce's current workflow is where he says it needs to work "as fast as possible". Well, yes, but we need to define what "possible" means. For example, it would be faster for us to push content directly into his data structures, but I'm pretty sure he wouldn't agree to that, even though it's definitely "possible". The reality is that, however we update api.bible, it's not going to happen faster than the user can hit "refresh" after uploading a new revision. So, at that point, we need to tell the user that "this might take a while", and I think we can come up with a definition of "a while" that makes sense architecturally.

Back to the story... the Great Demon of the Generic whispered to me in my sleep, and I'm now convinced we can solve this problem exactly the same way both for the DBL server and for DBL.Local. This would be nice for several reasons, not least that it means we can test the system logic locally and on copies of the data before entrusting the definitive S3 production records to it. I'm going to write a report with lots of pictures to explain the cunning plan.

smorrison commented 5 years ago

Initial design discussion document from @mvahowe entry_storage_model.pdf

smorrison commented 5 years ago

@mvahowe Mark gives 4 options (including the current implementation). fyi, "Copy-Forward" is the strategy used by the current uploader implementation.

The "sparse-storage" and "per-entry resource sharing" models are the only two that consider resource duplication.

Some general comments:

I don't think we need to consider entry/revision deletion at all. In practice we never do this.
The sparse-storage model does not require more than a linear search through revisions, and in fact this can be done locally in memory. It is just as easy to list all resources of all revisions of an entry as the resources of one revision.
A big drawback of the pooled resources in the per-entry resource sharing model is that old revisions (pre 2.0) will not have any idea of what their content is or should be, so determining the content will be tricky. I assume that the content can be determined via a resource naming scheme?
A considerable advantage to the sparse-storage model is that existing pre-2.0 revisions will not have their storage model changed at all.

IMO one scary thing about the pooled resources model is that if we lose/corrupt the metadata manifests for a revision, we have no hope of recovery.

In some respects, it doesn't matter how we arrange the data on "disk" as long as we have a stable interface via a maintained python module. The module will be the official documentation. I suspect that api.bible will want/need a node.js interface, but the python should stand as a reference implementation.

The important design goals are:

no duplicate data
optimize for lookup/listing of revision resources over adding new revision resources

My bias is toward the sparse storage model, but I'm not dogmatic about it. Further refinement of implementation details could help (e.g. how do we identify groups of resources for pre-2.0 revisions, how do we deal with identically named but different resources)

smorrison commented 5 years ago

This was pre-meeting material. I've closed this issue and created 3 epics (so far). One each for Creating, Packaging, and Migrating.

ubsicap / dbl-archive-data-storage

Update data layout in S3 #1