psychoinformatics-de / inm-icf-utilities

MIT License
0 stars 5 forks source link

Tar file generation includes too many path components #53

Closed mslw closed 2 months ago

mslw commented 3 months ago

The docstring for make_studyvisit_archive says:

positional arguments:
  <input-dir>           Directory with the files to place into the visit
                        archive. The input base directory itself is not put
                        into the archive (i.e., its own name is irrelevant).
                        Instead, a top-level directory with the name '<study-
                        id>_<visit_id>' is used to place all archive content
                        in.

Yet, the path within tarfile is generated as follows:

https://github.com/psychoinformatics-de/inm-icf-utilities/blob/38b5783cdb7b0e1c50077eaa469819a61108090e/bin/make_studyvisit_archive#L97-L99

i.e. only the top level directory is stripped from tinfo.name. This is fine if the DICOMs are in /tmp or any other top level directory, but if the input dir is elsewhere (e.g. /home/me/test-data/something) then all but the first component will make it into the path within tarfile. What we need is to take a path relative to the input dir and prepend one level.

mslw commented 3 months ago

Looking at it again, the positional argument is called input-dir and the docstring says "The input base directory itself is not put into the archive (i.e., its own name is irrelevant)." It is a little ambiguous (or is it?) what the base directory is (e.g., for foo/bar/baz, is it foo or baz / the entire thing)?

But I still think the current behavior is at least suboptimal. This is how I first noticed:

❱ tree ~/Documents/dicom-samples/dicomdir
/home/mszczepanik/Documents/dicom-samples/dicomdir
└── 01
    └── MR000000.dcm

❱ python ~/Documents/inm-icf-utilities/bin/make_studyvisit_archive ~/Documents/dicom-samples/dicomdir -o /tmp/exports --id foo baz

❱ tar -tvf /tmp/exports/foo/baz_dicom.tar
-rw-r--r-- root/root  10823982 2024-04-26 12:40 foo_baz/mszczepanik/Documents/dicom-samples/dicomdir/01/MR000000.dcm

Initially I wrote:

This is fine if the DICOMs are in /tmp or any other top level directory

Not even that; I think it is only "fine" when relative paths are used. Notice that dicomdir is included when using an absolute path /tmp/dicomdir/:

❱ tree /tmp/dicomdir
/tmp/dicomdir
└── 01
    └── MR000000.dcm

❱ python ~/Documents/inm-icf-utilities/bin/make_studyvisit_archive /tmp/dicomdir -o /tmp/exports --id foo bar

❱ tar -tvf /tmp/exports/foo/bar_dicom.tar
-rw-r--r-- root/root  10823982 2024-04-26 12:40 foo_bar/dicomdir/01/MR000000.dcm

compare to a relative path:

❱ cd /tmp

❱ python ~/Documents/inm-icf-utilities/bin/make_studyvisit_archive dicomdir -o /tmp/exports --id foo bar2

❱ tar -tvf exports/foo/bar2_dicom.tar
-rw-r--r-- root/root  10823982 2024-04-26 12:40 foo_bar2/01/MR000000.dcm

I think the top-level directory in the tarball can be set as it is now, but all paths inside should be relative to the <input-dir> (a bit like tar -C, just with one extra directory). This will, however, be a behavior change for all but one-level paths provided as <input-dir>, as shown above.

One potential downside I can think of is removing some flexibility (e.g. what if I have /tmp/datastore/scans/dicomdir[1-10]/... and want only dicomdir3 but I also want paths to include datastore/scans/) - but I think the increased consistency is more important.

christian-monch commented 3 months ago

I @mslw, thanks for catching this. I worked on a PR, because I did not realize you have already opened one. If it is OK with you, I would add the regression test of my PR to your PR.