Closed bartdegoede closed 10 years ago
I went over the code, and overall I think it looks really nice. Some minor points that we should fix before this can merged:
/dumps
endpoint only returns the filenames of the available dumps. Changing it to return the absolute URLs of the files would increase flexibility (a server instance could dynamically change to location of files, for instance when moving to a storage provider like S3) and makes retrieving the dumps easier for end-users (they would only have to specify one URL).DUMP_URL
and fix the function calls in manage.py
where dump_url
isn't passed (see my line annotations)..gz.sha1
, since the content of this file says something about the gzip.docs/backup_and_restore.rst
).I like the organization of the test-suite, in particular the ability to target specific parts of the codebase (backend, frontend and modules within them). Eventually we may have to add another distinction (between unit and integration tests), but if @breyten doesn't have any objections against the current structure, I think the current layout looks fine for now.
@breyten, can you create the DNS record for hosting the dumps (dumps.opencultuurdata.nl
) and point it to the Open State server that currently hosts the API?
This is quite a big one, so let me concentrate on the test thing for now ;)
Kind of a brain dump. Any thoughts? :)
As for reinexing all content, I have no problem with this. (I think it's better than writing a one-time migration script)
Oh, by the way, the item tests for local dump and restore is essintial, by the way (Since we're in control of both angles this makes sens) :P
I think I do follow your reasoning, and I think we should distinguish between unit tests and integration tests. We do need integration tests (which are kind of beyond the scope of this particular pull request), and we should think about how to structure those properly.
I agree that running tests for most items (although not the new local items ;-)) from a unit test perspective (i.e. without accessing external services) seems a bit pointless. However, I do think we should create them anyway, and use them as a part of our integration tests, since there is a lot of custom logic in the items, and I think it would make sense to require them for pull requests for new collections, as the contributor would know best what expected behaviour of items is, especially when it comes to things as rights and semantics of certain fields (date
being my favourite example; is it a creation_date
, publication_date
, or birth_date
?).
TLDR; the tests should be created in such a way it is easy to set raw_item
and item
attributes with actual values in integration tests, preferably by the original authors, as they probably know best why they put certain values in certain fields.
With regard to the nose test runner; I prefer more formal objects to "just a bunch of methods", as it forces a more structured approach. Also, setUp
and tearDown
are managed on the file level you mention anyway, as there is generally one TestCase
per Extractor
, Transformer
, Loader
and Item
anyway (with the exception of the LocalDumpItem
, as that is a special case).
Fixed comments from @justinvw, merging into dev
:+1: on the merge
Hmm I get what you mean. I'll write one tomorrow with an item unit test :P
Backup and restore
This pull request contains changes to the API and ETL pipeline for supporting creating and restoring backups (also referred to as "dumps") of data collections, regardless of source and target. We can make dumps available as download, which can then easily be ingested into local API instances. For instance, it does not matter if the dump originates from another OCD API instance than your own, which makes it quicker for developers to get started.
CLI
A command line interface is included for creating dumps and loading dumps into the ETL-pipeline. There is support for downloading (for other API instances), creating (dumping your own collections), listing available collections, and loading data. For a more detailed usage explanation, please refer to the included documentation.
ETL-pipeline
The ETL-pipeline now provides an extractor for dumps. It uses the default transformer, as most of the functionality of marshalling items is contained within Item classes anyway. Keeping with this pattern, there is a
LocalDumpItem
that takes care of extracting the relevant fields from dumps. At this point, the Item is the same as all other items, so the sameElasticsearchLoader
is used.N.B.: In order to support the creation of the
combined_index
from collection-specific dumps, I had to include an extra field in each document containing the data from the combined index as string (which is not indexed). Unfortunately, this means that before we can run this "in production", we have to reindex all data once this pull request is merged.Documentation
As briefly mentioned above, this pull request also contains usage documentation, as well as API documentation for the CLI.
Any comments, questions and improvements are appreciated :-)
Tests, tests and some more tests
@breyten, @justinvw: I took the liberty to provide a test-suite layout within the context of this pull request. It can be activated using the script @breyten initially used (thus not breaking Travis). The general idea is that there is a directory for the main packages (
ocd_backend
andocd_frontend
), which contains directories which contain test cases per module. In general, each module's__init__.py
contains a base testcase, which can be extended for each submodule that is written. For example, intests.ocd_backend.items.__init__
aItemTestCase
is defined, which is subsequently extended intests.ocd_backend.items.localdump
, providing item-specific tests.I slightly modified the
run_tests.sh
script, in order to include all Python files, not only the ones that are touched during testing.Testcases should be created for every Item that is already available in the repository, and I would even go as far as proposing to only accept pull-requests which provide testcases. Using this layout, it should be relatively easy to spot missing tests, and maintain existing tests, as the layout of the tests mirrors the project layout.