medvedevgroup / ESSColor

7 stars 2 forks source link

Test 4,546 Salmonella genomes? #6

Open jermp opened 11 months ago

jermp commented 11 months ago

Dear all,

I'm trying to build your compressed representation (for k=31) on a rather small pangenome, which can be downloaded from here https://zenodo.org/records/1323684 and contains 4,546 Salmonella genomes. Can you please try to build your archive on the same data?

Specifically, the pipeline run for ~5h before aborting, saying "no space left on device" which is very strange because I have over 1.5T available. Also, I've noticed that the pipeline outputs some very large intermediate files, like 186 GB. Do you confirm? Is there any parameters I need to set (I've set -k 31 and -j 8)?

Thanks! Best, -Giulio

amatur commented 11 months ago

Hi Giulio,

The current pipeline is not very optimized for intermediate disk usage unfortunately, but there are some easy fixes. It's expected to have high disk usage, since we dump the intermediate uncompressed color matrix to disk (which is not even gzipped). The other issue is that the current version in github only supports upto 128 colors (I realize this constraint is not documented anywhere).

Currently I am working on fixing these two issues. I have an experimental implementation that supports larger number of colors. I will test if it works on this dataset and then update the repo with the fixes.

Thanks, Amatur

jermp commented 11 months ago

Hi @amatur, thank you for the answer and confirmation about the space usage.

we dump the intermediate uncompressed color matrix to disk (which is not even gzipped)

Yes, I think this is a severe limitation because it would prevent the use for even small files.

The other issue is that the current version in github only supports up to 128 colors.

Oh, that's why! Let me know.

Best, -Giulio

jermp commented 10 months ago

Hi @amatur and @yoann-dufresne, any update on this matter?