openworm / owmeta

Unified, simple data access python library for data & facts about C. elegans anatomy
MIT License
152 stars 49 forks source link

alpha0.5: re-integrate the data repo #54

Closed slarson closed 9 years ago

slarson commented 10 years ago

Versioning the data outside of the library is the opposite of what we want to achieve here. We want the data version tightly coupled to the library as a feature because then users know what data they have in a given version of the library. Decoupling it will create confusions and mismatches between which data version versus which library version, and is also likely to make queries fail. This is also the reason not to use an externally hosted database that may be changing over time. We are following a "data as code" idea here.

Please bring back any data you've put outside this repo or let me know where those calls are and I'll do it in my branch.

mwatts15 commented 10 years ago

The data repo was not separated without good reason. The data is, to a reasonable extent, schema-less. Without changes to the EAV structure embodied in dataObject.py, data will be accessible over revisions do not change this core structure.

Moreover, git submodules (one is used to relate PyOpenWorm to the data repo) pin to a specific commit in the sub-module. With the current common case, the data revision will be exactly matched to the code. In the (hopefully near-) future case of multiple shared data repositories, it will be impossible to make guarantees on every repository beyond those of the repository owner.

slarson commented 10 years ago

The data repo was not separated without good reason.

@mwatts15 Could you provide more of a justification for why it needs to be a separate repo? I haven't seen this documented anywhere and we never had a conversation about it.

mwatts15 commented 10 years ago

Generally, it's about separation of concerns. More specifically, changes to the data would create unnecessary noise in commit logs and patches for PyOpenWorm. By putting the repo in a git submodule, it's easy to avoid this problem and nothing is lost in doing it.

slarson commented 9 years ago

Doing the re-integrating in the alpha0.5-slarson branch for now. Was disposed to try out the submodule approach and was making modifications to the data in the submodule. Then tried to re-point the submodule repo towards another one under the openworm org and all hell broke lose with syncing, causing a lot of wasted time trying to figure out how to reattach the changes I had made in the repo. Plus there are plenty of potential issues with regards to maintaining this down the road.

The statement "nothing is lost in doing it" is false when considering adding additional complexity to the maintenance of this repository. It became immediately obvious that tests on the data itself needed to be created, and then a whole new infrastructure of hooking the repo up to travis, implementing tests, implementing setup.py became necessary, which are unnecessary when this is integrated with the main repo. The view of this repo is that it is not unnecessary noise to be patching and upgrading the data, in fact this is exactly what we should be doing here.