why does the disk-usage double for an installed sourcedata?

pvavra commented 3 years ago

I am really puzzled by the internal workings of datalad/hirni for imported tarballs. Specifically, I do not understand where the used disk-space comes from.

Initial setup:

datalad create bids
datalad run-procedure -d bids cfg_bids
datalad create sourcedata
datalad run-procedure -d sourcedata cfg_hirni

Now I have a tarball of 2.4GB which I import:

datalad hirni-import-dcm -d sourcedata some_dicoms.tar acq1
du -sh sourcedata # gives 2.4G

but if I install that and get the data, it doubles:

datalad install -d bids -r -s sourcedata
datalad get -d bids/sourcedata -r
du -sh bids/sourcedata # gives 5.7GB

Shouldn't the installed dataset simply by a copy of the original sourcedata at this point (except for some metadata, like git-remotes being defined, etc.).

What is going on in the background here?

bpoldrack commented 3 years ago

At a first glance, I'd say your datalad get -d bids/sourcedata -r doesn't only get the data of your hirni dataset, but also the containers of the toolbox (sourcedata/code/hirni-toolbox), since its a recursive operation.

pvavra commented 3 years ago

no, that accounts for less then 1GB.

The "offending" path is acq1/dicoms with 4.8GB.

bpoldrack commented 3 years ago

Seems to suggest that you get the archives, too. But get shouldn't do that ...

I'll look into this (and #183) tomorrow.

Edit: Note to myself: Yes, it does! That's a feature but with non-obvious side-effect for anyone not regularly using the "3-branch" approach (not even myself, obv.). We need a convenient and obvious way to drop! Plus: Probably an import option, that doesn't do this in the first place, since it's only relevant for archive updates.

bpoldrack commented 3 years ago

Okay, to elaborate on my hasty remarks above, here comes the explanation, @pvavra. I'll follow up with a post about what you can do and what we might do in hirni in order to improve the user experience and why it's build like that in the first place.

When you import an archive, it gets annexed in a dedicated branch incoming and its extracted content in incoming-processed and ultimately merged into your user branch (lmaster). Furthermore, the extracted content is annexed in a particular way: It uses the datalad-archives special remote to register the (already annexed) archive a source for the extracted files. This special remote can now (re-)retrieve any of the DICOMs by extracting them from that archive OR by getting them from any other remote. This means for a fresh clone with no other remotes than origin, there are two registered sources that can be used to get the DICOM files:

Get them directly from origin if available from there.
Get them by extracting from the archive, which in turn would be retrieved from origin as the only known source for that archive. This would be slightly different if there was an additional source for the archive (i.e. if you downloaded it from some URL instead of a local path when originally imported - this original URL could then be used again)

Now, in your case you have sourcedata/acq1/dicoms as the remote origin for your bids/sourcedata/acq1/dicoms. The former has the DICOM files dropped and only kept the archive. Therefore the first option is not available and the only known source in your clone (bids/sourcedata/acq1/dicoms) is the datalad-archives special remote, that knows it can extract the files from the archive. So this special remote tells annex: Hey, I need that archive. Now annex gets the archive from its only known source origin and then the datalad-archives special remote extracts the DICOMs from it. As a result you now have both, the archive and the DICOM files in your clone bids/sourcedata/acq1/dicoms.

bpoldrack commented 3 years ago

The reason for that approach is, that this way it is possible to connect to pretty much any kind of "authorative"/backup system for the imported archives. By importing from there via an URL one can keep that connection and drop everything in the datasets, while maintaing complete provenance and reproducibility from this precious raw data backup system (or whatever an institution may have). Although not really conveniently supported by the import command ATM, it also allows for archive updates, if there was something wrong with the original one, while your master branch would only show the updates with respect to the DICOM files. Finally, if one needs to get the data (which likely will be all of the DICOMS, say for conversion) it may be faster to get the (compressed) archive and extract rather than transferring all the extracted files. Whether or not that's true depends of course on a bunch of constraints one may have.

Now, what can you do ATM:

You could have your original dataset make the extracted files available (and possibly drop the archive instead - again: depends on your needs and whether you need/have a registered copy of the archive elsewhere). Then your clone in the bids dataset can retrieve the DICOM files w/o going through the archive. Thus there would be no need to get the archive first.
You could of course drop the archive(s) from the incoming branch(es) afterwards
Generally, one doesn't need to go through the import command to begin with. You could build the dicoms subdatasets anyway you want yourself, then run metadata extraction+aggregation and finally hirni-dicom2spec. However, different ways of building them should probably become different modes of the import command.

What hirni can/should do:

First of all, there's a need to massively improve the docs - no doubt about that ;-)
The approach of how archives are imported isn't needed for everyone, so we could build another mode of operation, where the archive never gets annexed to begin with and we wouldn't end up with the two additional branches. Would like to hear your view on what you really need there.
There could be a configuration in the dicoms datasets (by default?) that would trigger an automatic drop of the archive after such a get.

bpoldrack commented 3 years ago

If you have additional thoughts on what would be nice to have in hirni in that regard, I'm happy to hear that, @pvavra!

pvavra commented 3 years ago

I think I understand the design choices behind the incoming and incoming-processed branches and they make sense to me. That is, no issues are jumping at me right away.

For my use-case, the ideal scenario would be the following, I think: During the hirni-import-dcm I specify the URL as ssh://some_server:/path/on/server/some_dicoms.tar where I handle the ssh credentials via .ssh/config (so definitely they do not get committed into the history even by accident). (note: this does not work yet)

Ideally, I think, the retrieved tarball should not be kept in sourcedata (or at least give me a flag for disabling keeping a copy). That is, it should call the suggested drop of the tarball per default.

Rational: Since datalad is handling sourcedata, I probably will never consider that dataset the true "raw data" which should be archived. Whatever is imported into it needs to be archived "before" importing. But then we do not need to keep the data in sourcedata (annexing it for the purpose of checksum calculation is good. That way, I can now whether the file changed at some later stage in the archive).

The unpacked files, however, should stay in the working tree, for further processing - as is done atm.

Installing that sourcedata dataset into bids (or anywhere else) should never get the tarball, unless the extracted files are missing and cannot be retrieved anywhere else (at least anywhere locally). And when the tarball is indeed retrieved, for extracting the dicoms, it should get automatically dropped again, since now the unpacked dicoms are again available in bids.

I guess, the (remote) tarball should work as a special git annex remote. Then, we would see git annex whereis [...]/some_dicom_image.dcm point (among other sources) to something like [tarball on server xxx]. On the incoming branch, only the info necessary for accessing the remote should be kept, I think. But not sure whether a branch is needed for this..

Now, as mentioned above, the import from an ssh:// isn't working yet. At the moment, I use the workaround mentioned in that issue, i.e. rsync the data onto a scratch partition, and import the file from there But even in this scenario, I don't consider sourcedata my raw data and would not rely on it for keeping the only copy of the unprocessed data. So keeping a copy of the tarball is not necessary, I think (or alternatively, keeping only the tarball, as is currently the case, from what I understood). In either case, any subsequent datalad install of the sourcedata should "hide" away any complexity related to remotes, and have only the unpacked dicoms be visible to the user (as is the case, I think). And a datalad get should in no case copy both the tarball and the unpacked files over. There is no scenario I can think of where both would be needed..

bpoldrack commented 3 years ago

Note: autodrop of archive after get likely to be solved in datalad: https://github.com/datalad/datalad/issues/5519

pvavra commented 3 years ago

Thanks for opening the issue for the defaults over at datalad. However, I think I have another issue with how the datalad-archive works: Repeated datalad get acq1 and "full drop" (cf https://github.com/psychoinformatics-de/datalad-hirni/issues/183#issuecomment-804720940) increase the size of the repo.

This stems from the .git/objects folder increasing. Inspecting that suggests that this is because of some MD5....log.web files are being added.

To me, this seems like getting the data via the datalad-archive commits a log, probably one per file. In my particular (but probably relatively typical) case, this means that the git-repo grows approx 100MB for each repeat get/drop loop.

I would not expect that (-> something for the docs ;)).
Is there a way of preventing that?
I don't think this should be happening for hirni datasets in the first place.

Assuming I identified the reason correctly, my third point has the following rationale: Many scanners record single-volume dicoms (or even slice-wise) and hence a normal acquisitions can have easily many thousands of files. Multiply that with 40 participants and maybe 2-3 sessions, we are about two orders of magnitude larger then the 100MB in this test-case. Extrapolating this to my project, I expect a single "get/drop" loop on the full dataset would temporarily add 8GB of git history. Note that this is logged under bids/sourcedata/../dicoms which is not the original sourcedata-dicoms repo. On the final step of datalad uninstall -d bids sourcedata everything there would get lost anyway.

In the meantime, I assume that any subsequent datalad status calls (and similar) would take substantially longer, especially if I have to use the -r flag for something other than status..

Crucially, this history is completely irrelevant to provenance tracking, I think. I do not care about how the data gets into bids at that level, I only care about tracking the checksums of the actual files, i.e. knowing which version of the files was used for conversion. The underlying "annex-plumbing" (datalad-archive here, I think) is like an OS-operation for me: I assume they work, but do not need to log them (unless I'm debugging, maybe).

psychoinformatics-de / datalad-hirni

why does the disk-usage double for an installed sourcedata? #182