bartdegoede commented 10 years ago

Backup and restore

This pull request contains changes to the API and ETL pipeline for supporting creating and restoring backups (also referred to as "dumps") of data collections, regardless of source and target. We can make dumps available as download, which can then easily be ingested into local API instances. For instance, it does not matter if the dump originates from another OCD API instance than your own, which makes it quicker for developers to get started.

CLI

A command line interface is included for creating dumps and loading dumps into the ETL-pipeline. There is support for downloading (for other API instances), creating (dumping your own collections), listing available collections, and loading data. For a more detailed usage explanation, please refer to the included documentation.

ETL-pipeline

The ETL-pipeline now provides an extractor for dumps. It uses the default transformer, as most of the functionality of marshalling items is contained within Item classes anyway. Keeping with this pattern, there is a LocalDumpItem that takes care of extracting the relevant fields from dumps. At this point, the Item is the same as all other items, so the same ElasticsearchLoader is used.

N.B.: In order to support the creation of the combined_index from collection-specific dumps, I had to include an extra field in each document containing the data from the combined index as string (which is not indexed). Unfortunately, this means that before we can run this "in production", we have to reindex all data once this pull request is merged.

Documentation

As briefly mentioned above, this pull request also contains usage documentation, as well as API documentation for the CLI.

Any comments, questions and improvements are appreciated :-)

Tests, tests and some more tests

@breyten, @justinvw: I took the liberty to provide a test-suite layout within the context of this pull request. It can be activated using the script @breyten initially used (thus not breaking Travis). The general idea is that there is a directory for the main packages (ocd_backend and ocd_frontend), which contains directories which contain test cases per module. In general, each module's __init__.py contains a base testcase, which can be extended for each submodule that is written. For example, in tests.ocd_backend.items.__init__ a ItemTestCase is defined, which is subsequently extended in tests.ocd_backend.items.localdump, providing item-specific tests.

I slightly modified the run_tests.sh script, in order to include all Python files, not only the ones that are touched during testing.

Testcases should be created for every Item that is already available in the repository, and I would even go as far as proposing to only accept pull-requests which provide testcases. Using this layout, it should be relatively easy to spot missing tests, and maintain existing tests, as the layout of the tests mirrors the project layout.

justinvw commented 10 years ago

I went over the code, and overall I think it looks really nice. Some minor points that we should fix before this can merged:

Right now the /dumps endpoint only returns the filenames of the available dumps. Changing it to return the absolute URLs of the files would increase flexibility (a server instance could dynamically change to location of files, for instance when moving to a storage provider like S3) and makes retrieving the dumps easier for end-users (they would only have to specify one URL).
Implementing the previous point would remove DUMP_URL and fix the function calls in manage.py where dump_url isn't passed (see my line annotations).
The extension of the file that contains the hash should be .gz.sha1, since the content of this file says something about the gzip.
I found some small typos in the documentation (see my line comments in docs/backup_and_restore.rst).
Maybe you could also briefly document the path and URL settings and mention the ability to override these via the CLI?

I like the organization of the test-suite, in particular the ability to target specific parts of the codebase (backend, frontend and modules within them). Eventually we may have to add another distinction (between unit and integration tests), but if @breyten doesn't have any objections against the current structure, I think the current layout looks fine for now.

@breyten, can you create the DNS record for hosting the dumps (dumps.opencultuurdata.nl) and point it to the Open State server that currently hosts the API?

breyten commented 10 years ago

This is quite a big one, so let me concentrate on the test thing for now ;)

I like the organization to seperate the tests between the packages (frontend and backend), which totally makes sense :-)
besides the nose tests runner, nose actually makes it easier and quicker to create test cases by not requiring that each test case be an object. It can also deal with setup/teardown across test cases by splitting them into files and having one setup/teardown for each file
I am deeply conflicted about creating test cases for each item (itemextractor, really), for a couple of reasons:
- Huge maintenance costs -- creating test cases costs time, maintaining them costs even more
- I wonder about the effectiveness of all of this, since we're running it against an external service. Or otherwise said, if we're storing json data in our repo, it is still very possible that the tests will run ok, but that the extraction process will go wrong because the remote api changed. (I'm not even sure if having item tests will help us determine if our source if faulty or if the remote api changed, because the tests data is local, and therefore may not be the same as the data of the object on the api). ( I hope you can follow the reasoning :P )

Kind of a brain dump. Any thoughts? :)

breyten commented 10 years ago

As for reinexing all content, I have no problem with this. (I think it's better than writing a one-time migration script)

breyten commented 10 years ago

Oh, by the way, the item tests for local dump and restore is essintial, by the way (Since we're in control of both angles this makes sens) :P

bartdegoede commented 10 years ago

I think I do follow your reasoning, and I think we should distinguish between unit tests and integration tests. We do need integration tests (which are kind of beyond the scope of this particular pull request), and we should think about how to structure those properly.

I agree that running tests for most items (although not the new local items ;-)) from a unit test perspective (i.e. without accessing external services) seems a bit pointless. However, I do think we should create them anyway, and use them as a part of our integration tests, since there is a lot of custom logic in the items, and I think it would make sense to require them for pull requests for new collections, as the contributor would know best what expected behaviour of items is, especially when it comes to things as rights and semantics of certain fields (date being my favourite example; is it a creation_date, publication_date, or birth_date?).

TLDR; the tests should be created in such a way it is easy to set raw_item and item attributes with actual values in integration tests, preferably by the original authors, as they probably know best why they put certain values in certain fields.

With regard to the nose test runner; I prefer more formal objects to "just a bunch of methods", as it forces a more structured approach. Also, setUp and tearDown are managed on the file level you mention anyway, as there is generally one TestCase per Extractor, Transformer, Loader and Item anyway (with the exception of the LocalDumpItem, as that is a special case).

bartdegoede commented 10 years ago

Fixed comments from @justinvw, merging into dev

breyten commented 10 years ago

:+1: on the merge

breyten commented 10 years ago

Hmm I get what you mean. I'll write one tomorrow with an item unit test :P

openstate / open-cultuur-data