medvedevgroup / TwoPaCo

A fast constructor of the compressed de Bruijn graph from many genomes
Other
40 stars 10 forks source link

Creation of many temporary files (of considerable size) when there are many references. #23

Closed rob-p closed 4 years ago

rob-p commented 4 years ago

Hi,

I have noticed some behavior I was not expecting when using TwoPaCo to generate compacted dBGs for input fasta files with many distinct references. Specifically, we are making use of TwoPaCo internally in pufferfish indexing, and one of the common use cases now is to index a transcriptome for subsequent salmon quantification. Here, the total size of the sequence is small ~300M for the human transcriptome, but the number of individual fasta entries is very large (~200,000).

The behavior I noticed is that TwoPaCo creates, during processing, a temporary file in its temp directory for every input sequence in the fasta file. So, we get a temp folder with ~200,000 distinct files! This seems to be a particular problem for some users who are doing indexing on cluster machines (with NFS-mounted drives).

In addition to the large number of distinct files being created, the total size of the temporary directory grows quite large. For example, for the human transcriptome (again, ~300M of input sequence), the TwoPaCo temp directory grows to ~14G before files start being deleted.

I have two main questions. First, is this large intermediate disk-space usage expected, and if so is there some way that it can be controlled? Second, is there some way to avoid or alter the behavior of creating one temp file per input sequence? This still works (as long as we're not on an NFS) for transcriptomic sequences, but some large metagenomic sequences have literally created more files in a directory than the file system is willing to handle. Ideally, there may be some way to "block together" temporary files for distinct references so that, rather than 1 temp file per-reference there was a temp file for different buckets of references or some such.

Thanks again for the great tool, and for any insight or suggestions you have on the above!

--Rob

iminkin commented 4 years ago

Hi Rob,

I am really glad that you still find TwoPaCo useful. Yes, creating temporary files is expected behavior. I will think about a possible solution and get back to you shortly.

Thank you,

Ilia Minkin

iminkin commented 4 years ago

Hi @rob-p,

Please check out the version in the branch https://github.com/medvedevgroup/TwoPaCo/tree/0.9.4. It should have the issues you mentioned solved: instead of a pack of temporary files, there should be a few, with size proportional to the input. It is not a release yet but will be in the near future. Please let me know if it works for your inputs.

Thank you,

Ilia Minkin

rob-p commented 4 years ago

Hi @iminkin,

Thank you very much; this is fantastic! We are merging the changes from the 0.9.4 branch into the develop branch of pufferfish and are doing regression testing. We will let you know when that is complete. However, the initial results are ... quite amazing! For the problem case I mentioned above, the total intermediate disk usage drops from ~14G to ~185M. Also the reduction in I/O brings down the total build time by almost 50%. Thank you so much for your quick response and brilliant optimization here! I'll check back and close this issue once our regression tests are completed.

Best, Rob

rob-p commented 4 years ago

Hi @iminkin,

We've tested the new construction code out on a few of our common datasets, and can confirm that, on these, the modified program gives the same outputs and often performs (considerably) better than the previous code. Not only is the intermediate disk usage considerably smaller, but the reduced I/O often leads to markedly-faster construction. Thanks again for your quick resolution of this issue that was affecting a core component of our indexing pipeline!

Best, Rob