learningtapestry / metadataregistry

DEPRECATED - THIS CODE BASE IS NO LONGER MAINTAINED. Metadata Registry
Apache License 2.0
7 stars 5 forks source link

Backup documents to archive.org #6

Closed aspino closed 8 years ago

aspino commented 8 years ago
aspino commented 8 years ago

I've been playing with the archive.org API and this is what i've found so far:

aspino commented 8 years ago

According to the first spec we created, the idea was to create dump (or backup) files that grouped some envelope transactions together ordered by date, so they could be played back later and allow a fresh node to get in sync. I'm reproducing here that part of the document spec, so that everyone is aware of the current proposal.

Dump file

Envelope transaction format

Field Description
doc_id Original identifier of the document
status One of created, updated or deleted
transaction_date Date of last transaction affecting the document
document The original document (unless deleted)

Another way of doing this would be creating daily, or perhaps weekly/monthly (this could be configured at a node level) dump files with the transaction logs for that interval. So instead of fixing the number of records in every dump file, I suggest we fix the time interval. I think this makes the backup/restore process a little more predictable and probably easier to implement as well. What do you think?

science commented 8 years ago

Per our conversation today, I'm proposing we use many more envelopes per file and store each envelope on a single line of a text file. To accomplish that, I'm proposing we use Base64 encoding to convert the JSON to a string that can be stored on a single line.

We had talked by voice about using JWT to do the Base64 encoding, but after talking with @jimklo I think we should just use plain old Base64 without all the JWT cruft. It's simpler in every way. Seem good?

science commented 8 years ago

I guess for purposes of standards, we're encoding each row using RFC 2045..

aspino commented 8 years ago

@science I'm ok with using just plain Base64 instead of JWT, but then does this mean we're definitely discarding JSON format for dump files? even if we find a suitable streaming library like https://github.com/brianmario/yajl-ruby?

science commented 8 years ago

Yes - we're saying that we're storing JSON data structures, one JSON structure per line, encoded in Base64. When I was arguing against streaming libraries, I wasn't trying to say that there exist no quality streaming libraries. I was saying that there seems to be a fair number of terrible libraries out there in multiple languages. If that's the case, we could see some real problems when others try to access the archived files.

Keep in mind there are multiple use-cases for these files: one case is that we re-load the resources into LR 2.0. Another case is that an unknown user, using an unknown language, wants to download these files and load them into an unknown data system. I feel like the Base64 method is going to make this easier for them in general?

science commented 8 years ago

The bummer of Base64 is it's a bit harder to do grep-y kind of CLI stuff. That said I think you can: cat [archive.file] | base64 --decode | grep [pattern]

aspino commented 8 years ago

Yes, I agree using Base64 is a more flexible solution because all you need to do is stream a text file line by line, and that's probably something resolved in any current technology/environment.

But you're right that it's harder to peek inside the contents with tools like grep or ack and, overall, we lose readability. However, we could provide some guidance in the README for performing these low level tasks, or even add some small scripts or utilities to help mitigate the inconveniences derived from the encoding.

By the way, I think here's a possible solution for grep-ing using bash: http://stackoverflow.com/questions/17214038/pipe-each-line-of-a-file-to-a-command. Seems like you'd need to iterate over every line, but no big deal anyway.

jimklo commented 8 years ago

I'd agree about the whole readability issue when using Base64 (or BaseN - 128 and 256 are becoming more popular too nowadays). Really I'd push for using a higher base only because you'll get better compression, just be sure to stay within the ANSI range.

Presumably in an archive, things may not be ordered logically other than maybe creation date; so while grep, sed, and awk are great tools in general, I don't think that's the way we want to be searching these archives in general. If you feel that there is some need to regularly search these encoded files, maybe better archive organization is needed - either via some tags-as-file-or-folder-names mechanism.

If the point is for archive - I wouldn't worry too much about readability, but more about ability to decode and use. It would be like encrypting the files with GPG and then not keeping the key around so we could later decrypt.

I know folks aren't too worried about Base64 disappearing - however to be safe; one should at least keep the source for a simple stream decoder that's written in entirely one portable language that can be stored with the archive.

science commented 8 years ago

Base64 is such a standard, I'd argue for using it, even if 128 and 256 are better compression. I'd argue that Base64 + gzip will give very close to the same compression B256+gzip (I haven't tested it!).

We can discuss with Archive about how to provide decrypt tools. Probably a gzip and base64 decoder implemented in Cobol or BASIC would be the best!! :)

It doesn't seem like this approach will create any technical issues for the foreseeable future, and it seems quite simple to implement on our side, so :+1: from me.

aspino commented 8 years ago

A basic implementation of backup/restore to an external provider is implemented in master (only archive.org is available at the moment). You can take a look at the documentation describing the file format and associated rake tasks here: https://github.com/learningtapestry/learningregistry#backup-and-restore-using-an-external-provider

Caveats/Missing features:

science commented 8 years ago

Thanks. Regarding the backups for each community. I think it's vital that each metadata community is backed up to a distinct archive.org item/bucket. If all communities are going to the same item right now, I'd say it's a good proof of concept, but this ticket isn't finished. The need to deal with 1000 items per bucket can be dealt with in the next ~1000 days (after go-live), assuming we are backing up 1 day per file..

aspino commented 8 years ago

@science Splitting the backup files by community is relatively easy to implement. I can have it done over the week.

I think the assumption that only a single node will be backing up might not be accurate, because many different nodes could be uploading dump files to the same item and, in that case, the 1000 days would be divided by the number of active nodes. So, having 10 actives nodes means the associated archive.org item will be over its recommended capacity in 100 days.

This "problem" seems to imply the need of some sort of higher level management (above nodes) in order to ensure items are not over its capacity, create new ones when they do, notify the nodes that the item has changed, etc. This would require having some kind of global application that takes care of all these tasks, which seems rather inconvenient to me. Instead I'd propose using some simple conventions that can simplify item management:

So, as you can see, my idea is using naming conventions based on time intervals. If, for some reason, backing up daily becomes a requirement, or the number of nodes increases, we can create items spanning a single month (e.g: learning-registry-06-2016)... the time-based convention is the same and, since it's pretty simple to follow, item creation could be done manually in advance.

Let me know what you think.

aspino commented 8 years ago

After discussing the current options, we've decided that, for the time being, only a single-node scenario will be considered. My proposal from the last message might be useful if, in the future, many nodes are active, but otherwise you can safely ignore it.