ubsicap / dbl-archive-data-storage

Portable data storage layer for DBL and related tools
MIT License
0 stars 0 forks source link

Update data layout in S3 #1

Closed smorrison closed 5 years ago

smorrison commented 5 years ago

In anticipation of large entry types, we need to revisit how we deal with S3 (in terms of how we manage the data for entries/revisions).

All entries in S3 are data items with keys that match a template, which is <entry type>/<uid>/rev<n>/.... A new revision will contain all of the data for a revision. This was fine when we had a few hundred entries of text (roughly 50MB per revision). As we move to ingesting more data we need to consider the amount of waste that we're generating.

It seems intuitively obvious (i.e. not necessarily obvious or true) that we duplicate a lot of data with typical uploads. Past the initial upload of an entry, most revisions look very much like the previous revision.

I would like a system where any given revision only contains the files that have changed in that revision (relative to the previous revision). I imaging that the key structure of the entries would look very much like what it is today, only that the entry revisions would be "sparse" in that much of their content is actually held in a prior revision.

The S3 data is accessed by 3 different systems independently, so any change to the layout should require coordinated releases in those 3 systems. All are written in python (python 2 and 3).

I believe that we need to do the following:

mvahowe commented 5 years ago

Some initial thoughts:

klassenjm commented 5 years ago

This is not in any way to dismiss or sideline Mark's concern about partners accessing S3 directly. I just want to mention that the only party I am aware of doing this (or having been given keys to do this) is API.Bible. It was done for the reasons Bryce pointed out -- that in order to present API.Bible as a resource which DBL license holders could work through reliably, it needed a very efficient manner of remaining up to date with all DBL data. DBL itself and its API could never handle that load in its current architecture.

We need to maintain API.Bible as a dependable end-user application API for DBL data. We need a means of maintaining this which has integrity with how we want to manage DBL data.

mvahowe commented 5 years ago

Briefly, my issue with the justification for Bryce's current workflow is where he says it needs to work "as fast as possible". Well, yes, but we need to define what "possible" means. For example, it would be faster for us to push content directly into his data structures, but I'm pretty sure he wouldn't agree to that, even though it's definitely "possible". The reality is that, however we update api.bible, it's not going to happen faster than the user can hit "refresh" after uploading a new revision. So, at that point, we need to tell the user that "this might take a while", and I think we can come up with a definition of "a while" that makes sense architecturally.

Back to the story... the Great Demon of the Generic whispered to me in my sleep, and I'm now convinced we can solve this problem exactly the same way both for the DBL server and for DBL.Local. This would be nice for several reasons, not least that it means we can test the system logic locally and on copies of the data before entrusting the definitive S3 production records to it. I'm going to write a report with lots of pictures to explain the cunning plan.

smorrison commented 5 years ago

Initial design discussion document from @mvahowe entry_storage_model.pdf

smorrison commented 5 years ago

@mvahowe Mark gives 4 options (including the current implementation). fyi, "Copy-Forward" is the strategy used by the current uploader implementation.

The "sparse-storage" and "per-entry resource sharing" models are the only two that consider resource duplication.

Some general comments:

IMO one scary thing about the pooled resources model is that if we lose/corrupt the metadata manifests for a revision, we have no hope of recovery.

In some respects, it doesn't matter how we arrange the data on "disk" as long as we have a stable interface via a maintained python module. The module will be the official documentation. I suspect that api.bible will want/need a node.js interface, but the python should stand as a reference implementation.

The important design goals are:

My bias is toward the sparse storage model, but I'm not dogmatic about it. Further refinement of implementation details could help (e.g. how do we identify groups of resources for pre-2.0 revisions, how do we deal with identically named but different resources)

smorrison commented 5 years ago

This was pre-meeting material. I've closed this issue and created 3 epics (so far). One each for Creating, Packaging, and Migrating.