openstate / open-cultuur-data

The back- and front-end code that powers the Open Cultuur Data API
http://opencultuurdata.nl/
28 stars 18 forks source link

Backup and restore #67

Closed bartdegoede closed 10 years ago

bartdegoede commented 10 years ago

Backup and restore

This pull request contains changes to the API and ETL pipeline for supporting creating and restoring backups (also referred to as "dumps") of data collections, regardless of source and target. We can make dumps available as download, which can then easily be ingested into local API instances. For instance, it does not matter if the dump originates from another OCD API instance than your own, which makes it quicker for developers to get started.

CLI

A command line interface is included for creating dumps and loading dumps into the ETL-pipeline. There is support for downloading (for other API instances), creating (dumping your own collections), listing available collections, and loading data. For a more detailed usage explanation, please refer to the included documentation.

ETL-pipeline

The ETL-pipeline now provides an extractor for dumps. It uses the default transformer, as most of the functionality of marshalling items is contained within Item classes anyway. Keeping with this pattern, there is a LocalDumpItem that takes care of extracting the relevant fields from dumps. At this point, the Item is the same as all other items, so the same ElasticsearchLoader is used.

N.B.: In order to support the creation of the combined_index from collection-specific dumps, I had to include an extra field in each document containing the data from the combined index as string (which is not indexed). Unfortunately, this means that before we can run this "in production", we have to reindex all data once this pull request is merged.

Documentation

As briefly mentioned above, this pull request also contains usage documentation, as well as API documentation for the CLI.

Any comments, questions and improvements are appreciated :-)

Tests, tests and some more tests

@breyten, @justinvw: I took the liberty to provide a test-suite layout within the context of this pull request. It can be activated using the script @breyten initially used (thus not breaking Travis). The general idea is that there is a directory for the main packages (ocd_backend and ocd_frontend), which contains directories which contain test cases per module. In general, each module's __init__.py contains a base testcase, which can be extended for each submodule that is written. For example, in tests.ocd_backend.items.__init__ a ItemTestCase is defined, which is subsequently extended in tests.ocd_backend.items.localdump, providing item-specific tests.

I slightly modified the run_tests.sh script, in order to include all Python files, not only the ones that are touched during testing.

Testcases should be created for every Item that is already available in the repository, and I would even go as far as proposing to only accept pull-requests which provide testcases. Using this layout, it should be relatively easy to spot missing tests, and maintain existing tests, as the layout of the tests mirrors the project layout.

justinvw commented 10 years ago

I went over the code, and overall I think it looks really nice. Some minor points that we should fix before this can merged:

I like the organization of the test-suite, in particular the ability to target specific parts of the codebase (backend, frontend and modules within them). Eventually we may have to add another distinction (between unit and integration tests), but if @breyten doesn't have any objections against the current structure, I think the current layout looks fine for now.

@breyten, can you create the DNS record for hosting the dumps (dumps.opencultuurdata.nl) and point it to the Open State server that currently hosts the API?

breyten commented 10 years ago

This is quite a big one, so let me concentrate on the test thing for now ;)

Kind of a brain dump. Any thoughts? :)

breyten commented 10 years ago

As for reinexing all content, I have no problem with this. (I think it's better than writing a one-time migration script)

breyten commented 10 years ago

Oh, by the way, the item tests for local dump and restore is essintial, by the way (Since we're in control of both angles this makes sens) :P

bartdegoede commented 10 years ago

I think I do follow your reasoning, and I think we should distinguish between unit tests and integration tests. We do need integration tests (which are kind of beyond the scope of this particular pull request), and we should think about how to structure those properly.

I agree that running tests for most items (although not the new local items ;-)) from a unit test perspective (i.e. without accessing external services) seems a bit pointless. However, I do think we should create them anyway, and use them as a part of our integration tests, since there is a lot of custom logic in the items, and I think it would make sense to require them for pull requests for new collections, as the contributor would know best what expected behaviour of items is, especially when it comes to things as rights and semantics of certain fields (date being my favourite example; is it a creation_date, publication_date, or birth_date?).

TLDR; the tests should be created in such a way it is easy to set raw_item and item attributes with actual values in integration tests, preferably by the original authors, as they probably know best why they put certain values in certain fields.

With regard to the nose test runner; I prefer more formal objects to "just a bunch of methods", as it forces a more structured approach. Also, setUp and tearDown are managed on the file level you mention anyway, as there is generally one TestCase per Extractor, Transformer, Loader and Item anyway (with the exception of the LocalDumpItem, as that is a special case).

bartdegoede commented 10 years ago

Fixed comments from @justinvw, merging into dev

breyten commented 10 years ago

:+1: on the merge

breyten commented 10 years ago

Hmm I get what you mean. I'll write one tomorrow with an item unit test :P