miller-center / cpc-issues

Connecting Presidential Collections
Other
0 stars 0 forks source link

Develop a workflow for data #3

Open waldoj opened 10 years ago

waldoj commented 10 years ago

We scan things. We want them on a website as bulk data. How do we connect those two points? Keep in mind that:

waldoj commented 10 years ago

Let us assume that:

That produces 27 GB of images each day.

I think that the most straightforward way to upload these images will be to push them up from the machine on which they're stored, such as via scp. That can be scripted to run nightly, or run manually by the operator at the end of the day.

Then, on the server, run a nightly program to process those images into thumbnails and generate HTML pages to let people browse them.

Of course, there is the question of where these data are to be stored. Although it merits getting pricing from ITS, S3 seems like the obvious choice. As a frame of reference, storing one month's output for one month (540 GB), assuming 500k download requests from the public (at 2 MB/image), that's $73.38/month.

With each passing month, this price increases accordingly. By the tenth month, assuming a steady output of files and a corresponding increase in viewership (this latter assumption admittedly unlikely), the cost becomes $655.56. So long as these images are hosted by the Miller Center, this monthly cost continues. The overwhelming majority of the monthly cost comes from the amount of data being hosted (5.4 TB). This cost could be lowered by storing the master copies of these files locally, and treating the S3 copies as duplicates, allowing us to use Amazon's reduced redundancy storage, which brings the price down to $562.73. (That's assuming 10 TB of downloads each month.)

waldoj commented 10 years ago

Essentially, we're going to need to loop back internally after outputting these files. Ideally, our scanning application will permit us to store metadata within our filenames, (e.g., [president]-[reel]-[image]; [hayes]-[012]-[0000192]), but that's going to need to go into some kind of an asset management system. Fedora, I assume. I certainly hope that Fedora can handle storing images on S3, or on some sort of a remote server, rather than actually storing all of these images on the same filesystem as Fedora.

waldoj commented 10 years ago

Looks like Fedora is going to work OK:

Each datastream can be either managed directly by the repository or left in an external, web-accessible location to be delivered through the repository as needed.

I'd hoped that it might be possible to store everything in S3, but that's not possible in Fedora at present:

This plugin stores all object XML and datastream content on Amazon's Simple Storage System (S3) and serves as a good example of how a LowlevelStorage implementation can be built. However, we do not recommend the use of this plugin in production environments.... We plan to replace it with an Akubra S3 implementation in the near future.

On the other hand, we could host the site on EC2 and then use S3 for storage, by simply mounting it on the EC2 instance and treating it as local storage.

waldoj commented 10 years ago

There is another storage solution: the Internet Archive. I suspect quite strongly that they would provide the storage for free. They provide an AWS-like architecture, which is in many ways compatible with software designed to interact with AWS, although at a different endpoint.