Open jermp opened 11 months ago
Hi Giulio,
The current pipeline is not very optimized for intermediate disk usage unfortunately, but there are some easy fixes. It's expected to have high disk usage, since we dump the intermediate uncompressed color matrix to disk (which is not even gzipped). The other issue is that the current version in github only supports upto 128 colors (I realize this constraint is not documented anywhere).
Currently I am working on fixing these two issues. I have an experimental implementation that supports larger number of colors. I will test if it works on this dataset and then update the repo with the fixes.
Thanks, Amatur
Hi @amatur, thank you for the answer and confirmation about the space usage.
we dump the intermediate uncompressed color matrix to disk (which is not even gzipped)
Yes, I think this is a severe limitation because it would prevent the use for even small files.
The other issue is that the current version in github only supports up to 128 colors.
Oh, that's why! Let me know.
Best, -Giulio
Hi @amatur and @yoann-dufresne, any update on this matter?
Dear all,
I'm trying to build your compressed representation (for k=31) on a rather small pangenome, which can be downloaded from here https://zenodo.org/records/1323684 and contains 4,546 Salmonella genomes. Can you please try to build your archive on the same data?
Specifically, the pipeline run for ~5h before aborting, saying "no space left on device" which is very strange because I have over 1.5T available. Also, I've noticed that the pipeline outputs some very large intermediate files, like 186 GB. Do you confirm? Is there any parameters I need to set (I've set -k 31 and -j 8)?
Thanks! Best, -Giulio