hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
164 stars 64 forks source link

Feature request : dtar writes multiple files like tar | split pipeline #504

Open cessenat opened 2 years ago

cessenat commented 2 years ago

Hello mpifileutils developpers,

When tar'ing very big directories, tar files may exceed system accepted size for a file (typically 500 GB). This is why one can tar and pipe with split such as tar --verbose --create dir | split -d -b 512000m - fictar

I tried dtar, which is remarkable already on my PC, but I failed to do the same thing. I could only write one file mpirun -np 8 dtar --create --file fictar.tar dir BTW the --verbose option had to be removed which contradicts the man.

Do you think my request is reasonable ? Does it make sense ?

Thanks a lot.

Olivier Cessenat

adammoody commented 2 years ago

Thanks for your suggestion, @cessenat . I also agree that would be a nice addition.

I think we'd need to code this support directly into dtar. We could add something like a -b 512G option to star. Computing the output file(s) and corresponding file offsets where each process needs to write its content should be straight-forward. The more complicated bit would be to modify the code that actually writes to the output file. We'd need to break the write operations at file part boundaries appropriately.

While at it, it would also be useful to create a separate dsplit/dcat tools to efficiently split large files into parts and concatenate a list of files back into one. Those wouldn't help with the case where the output archive is too large for the file system, but they could be useful to people who already have a large file.

adilger commented 2 years ago

@adammoody, IMHO, it is much more complex add support for splitting a single file across archives, since as you wrote it would need to handle the case of a single input file being written into two different tarballs. It would be simpler to have the tree traversal appropriately group files into tarballs that are <= the the "split" size (I believe it already knows in advance what each file's size is, so working this out isn't very complex), and separate input files can be processed in parallel to different output files (probably improving performance as well).

This would also simplify the untar operation, since each tarball would be a proper standalone file that could be moved/extracted (serially or multiple files in parallel), without first having to assemble it into a single piece, or depending on all the other parts to be available.

adammoody commented 2 years ago

Thanks, @adilger . Yes, my first thought jumped to creating multiple, smaller output archives by grouping files approriately. That's a bit more open ended, since there could be multiple ways to group files into the archive files that people might want, e.g., pack items by subdirectory vs pack items by best fit.

The specific request of extending dtar to provide something equivalent to a tar | split command is not as flexible, but it is better defined. As you note, the resulting part files would not be valid archives individually. They would need to be concatenated to be extracted (or at least logically concatenated). On the flip side, this would be compatible with what people currently do with split and cat. It would also support the case where the desired part size happens to be less than the size of one of the constituent files.

I can imagine that both of these features would be useful, a direct replacement for tar | split and an option to produce multiple output archive files all below some max size.