tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

compressing the eventalign tsv files #128

Closed mmiladi closed 4 years ago

mmiladi commented 4 years ago

Hi,

Requiring the plain text .tsv files as input for sampcomp require a lot of storage that becomes a limiting factor in the analysis capacity. It would be nice if you could support the tsv.gz as input.

Best,

a-slide commented 4 years ago

Hi @mmiladi, Just to confirm you are talking of the NanopolishComp Eventalign_Collapse output, aren't you? Not the nanopolish eventalign raw output (which by the way is much bigger).

mmiladi commented 4 years ago

Yes. For the nanopolish eventalign output, it's not a problem because one can pipe it to gzip and then zcat it to NanopolishComp, or directly pipe the two tools.

a-slide commented 4 years ago

Yes that was my thinking when I developed the tool. So the reason why we use NanopolishComp Eventalign_Collapse output uncompressed is to have random access to the raw data when running NanoCompore, thanks to the index file. It is also possible to do that with gzip but it is extremely inefficient in terms of IO.

I guess bzip2 format might be good compromise but I am sorry to say that this is not on our immediate priority list at the moment.

DictZip might be an option as well.

a-slide commented 4 years ago

We would gladly accept a PR to both NanopolishComp and Nanocompore if you want to have a go at it :D

mmiladi commented 4 years ago

Thanks for the info. And also thanks for the nice and well-documented work! Unfortunately, I don't now much about the underlying algorithm, so I am trying to contribute in other aspects :) : https://github.com/bioconda/bioconda-recipes/pull/21747

a-slide commented 4 years ago

Thanks