Backup documents to archive.org

aspino commented 8 years ago

[x] Dump all transactions in the node and store them into archive.org for later retrieval
[x] Allow the system to accept future backup providers (such as S3)
[x] Make every community backup to a different item

aspino commented 8 years ago

I've been playing with the archive.org API and this is what i've found so far:

Upload mechanism is modeled after S3, with minor differences (https://github.com/vmbrasseur/IAS3API). This has the advantage that adding S3 compatibility should be trivial once we've implemented the archive.org backup process.
They use the concept of item that internally maps to a bucket
An item can have as many files as you want (just as a bucket can)
An item can have associated metadata but the individual files inside it can't
Listing the contents of an item (bucket) involves just accessing the root URL (e.g: http://s3.us.archive.org/learning-registry-test)
The search API URLs can be built following this form: http://archive.org/advancedsearch.php#raw. However, I don't think we'll need to use it for the backup/restore processes.
Seems like they provide a .torrent file and also keep a .sqlite database for every item. Perhaps they could be useful in the future.

aspino commented 8 years ago

According to the first spec we created, the idea was to create dump (or backup) files that grouped some envelope transactions together ordered by date, so they could be played back later and allow a fresh node to get in sync. I'm reproducing here that part of the document spec, so that everyone is aware of the current proposal.

Dump file

Includes all transactions occurred in the system (delete/update/new records)
Contains a json array of envelopes limited to 100/1000 records (to be decided)
All envelopes are ordered by transaction date
File names are ISO date oriented
- Date of first envelope in file - date of last envelope in file

Envelope transaction format

Field	Description
`doc_id`	Original identifier of the document
`status`	One of `created`, `updated` or `deleted`
`transaction_date`	Date of last transaction affecting the document
`document`	The original document (unless deleted)

Another way of doing this would be creating daily, or perhaps weekly/monthly (this could be configured at a node level) dump files with the transaction logs for that interval. So instead of fixing the number of records in every dump file, I suggest we fix the time interval. I think this makes the backup/restore process a little more predictable and probably easier to implement as well. What do you think?

science commented 8 years ago

Per our conversation today, I'm proposing we use many more envelopes per file and store each envelope on a single line of a text file. To accomplish that, I'm proposing we use Base64 encoding to convert the JSON to a string that can be stored on a single line.

We had talked by voice about using JWT to do the Base64 encoding, but after talking with @jimklo I think we should just use plain old Base64 without all the JWT cruft. It's simpler in every way. Seem good?

science commented 8 years ago

I guess for purposes of standards, we're encoding each row using RFC 2045..

aspino commented 8 years ago

@science I'm ok with using just plain Base64 instead of JWT, but then does this mean we're definitely discarding JSON format for dump files? even if we find a suitable streaming library like https://github.com/brianmario/yajl-ruby?

science commented 8 years ago

Yes - we're saying that we're storing JSON data structures, one JSON structure per line, encoded in Base64. When I was arguing against streaming libraries, I wasn't trying to say that there exist no quality streaming libraries. I was saying that there seems to be a fair number of terrible libraries out there in multiple languages. If that's the case, we could see some real problems when others try to access the archived files.

Keep in mind there are multiple use-cases for these files: one case is that we re-load the resources into LR 2.0. Another case is that an unknown user, using an unknown language, wants to download these files and load them into an unknown data system. I feel like the Base64 method is going to make this easier for them in general?

science commented 8 years ago

The bummer of Base64 is it's a bit harder to do grep-y kind of CLI stuff. That said I think you can: cat [archive.file] | base64 --decode | grep [pattern]

aspino commented 8 years ago

Yes, I agree using Base64 is a more flexible solution because all you need to do is stream a text file line by line, and that's probably something resolved in any current technology/environment.

But you're right that it's harder to peek inside the contents with tools like grep or ack and, overall, we lose readability. However, we could provide some guidance in the README for performing these low level tasks, or even add some small scripts or utilities to help mitigate the inconveniences derived from the encoding.

By the way, I think here's a possible solution for grep-ing using bash: http://stackoverflow.com/questions/17214038/pipe-each-line-of-a-file-to-a-command. Seems like you'd need to iterate over every line, but no big deal anyway.

jimklo commented 8 years ago

I'd agree about the whole readability issue when using Base64 (or BaseN - 128 and 256 are becoming more popular too nowadays). Really I'd push for using a higher base only because you'll get better compression, just be sure to stay within the ANSI range.

Presumably in an archive, things may not be ordered logically other than maybe creation date; so while grep, sed, and awk are great tools in general, I don't think that's the way we want to be searching these archives in general. If you feel that there is some need to regularly search these encoded files, maybe better archive organization is needed - either via some tags-as-file-or-folder-names mechanism.

If the point is for archive - I wouldn't worry too much about readability, but more about ability to decode and use. It would be like encrypting the files with GPG and then not keeping the key around so we could later decrypt.

I know folks aren't too worried about Base64 disappearing - however to be safe; one should at least keep the source for a simple stream decoder that's written in entirely one portable language that can be stored with the archive.

science commented 8 years ago

Base64 is such a standard, I'd argue for using it, even if 128 and 256 are better compression. I'd argue that Base64 + gzip will give very close to the same compression B256+gzip (I haven't tested it!).

We can discuss with Archive about how to provide decrypt tools. Probably a gzip and base64 decoder implemented in Cobol or BASIC would be the best!! :)

It doesn't seem like this approach will create any technical issues for the foreseeable future, and it seems quite simple to implement on our side, so :+1: from me.

aspino commented 8 years ago

A basic implementation of backup/restore to an external provider is implemented in master (only archive.org is available at the moment). You can take a look at the documentation describing the file format and associated rake tasks here: https://github.com/learningtapestry/learningregistry#backup-and-restore-using-an-external-provider

Caveats/Missing features:

Backup only dumps transactions on a daily basis. Perhaps it would be useful to allow users to specify the backup interval (weekly, monthly, etc.)
Only a single item (bucket in S3 terms) is considered for backing up and restoring purposes (its value can be set using the INTERNET_ARCHIVE_ITEM env. variable). Since we know that more than 1000 files per item are not recommended, a way to automatically create and expand into new items will be necessary in the future.
Strictly related to the previous point, it's not clear yet how the different items that different nodes will use are going to be made available in archive.org. For example, suppose that I start a new node and want to restore all transactions from all nodes in the Learning Registry community... we'll need a way to know which items belong to this community, so we can download the dump files inside them and restore its transactions. I previously suggested using archive.org's collections (1 collection = 1 community).
The application should be ready to deal with large files, since downloads and file reads are always performed in a streaming fashion, but I guess we'll have to wait for some real tests to actually be 100% sure that everything behaves as we expect.

science commented 8 years ago

Thanks. Regarding the backups for each community. I think it's vital that each metadata community is backed up to a distinct archive.org item/bucket. If all communities are going to the same item right now, I'd say it's a good proof of concept, but this ticket isn't finished. The need to deal with 1000 items per bucket can be dealt with in the next ~1000 days (after go-live), assuming we are backing up 1 day per file..

aspino commented 8 years ago

@science Splitting the backup files by community is relatively easy to implement. I can have it done over the week.

I think the assumption that only a single node will be backing up might not be accurate, because many different nodes could be uploading dump files to the same item and, in that case, the 1000 days would be divided by the number of active nodes. So, having 10 actives nodes means the associated archive.org item will be over its recommended capacity in 100 days.

This "problem" seems to imply the need of some sort of higher level management (above nodes) in order to ensure items are not over its capacity, create new ones when they do, notify the nodes that the item has changed, etc. This would require having some kind of global application that takes care of all these tasks, which seems rather inconvenient to me. Instead I'd propose using some simple conventions that can simplify item management:

Naming the items using community (as you already said) and year. For example, we would have items named learning-registry-2016, learning-registry-2017, credential-registry-2016, and so on. Nodes won't need any special application logic to know what item they need to upload to, because the community and the year are easily deducted.
Creating backup files using a larger interval. Perhaps backing up daily is not really necessary, since the application is ready to deal with big files. What about weekly or monthly instead? That means less, but bigger files, which I think is ok. If we backup monthly instead of daily we can keep more than 80 active nodes per community using the same item in a year. If weekly, that's roughly 18 active nodes... is that realistic?

So, as you can see, my idea is using naming conventions based on time intervals. If, for some reason, backing up daily becomes a requirement, or the number of nodes increases, we can create items spanning a single month (e.g: learning-registry-06-2016)... the time-based convention is the same and, since it's pretty simple to follow, item creation could be done manually in advance.

Let me know what you think.

aspino commented 8 years ago

After discussing the current options, we've decided that, for the time being, only a single-node scenario will be considered. My proposal from the last message might be useful if, in the future, many nodes are active, but otherwise you can safely ignore it.

learningtapestry / metadataregistry

Backup documents to archive.org #6

Dump file

Envelope transaction format