vib-singlecell-nf / vsn-pipelines

A repository of pipelines for single-cell data in Nextflow DSL2
GNU General Public License v3.0
75 stars 31 forks source link

[SCENIC] Compress and reduce file size of outputs #270

Open dweemx opened 4 years ago

dweemx commented 4 years ago

This is critical especially for the SCENIC multi-runs pipelines. Currently the output from the GRN step save the network as .tsv file. This should be compressed as a .tsv.gz file instead. The cisTarget step save the motifs enrichment table as .tsv file. This should be compressed as .tsv.gz file This AUCell step save the Regulons AUC matrix as a loom file along the expression matrix. This should be saved compressed .tsv.gz file or loom without containing the expression matrix

cflerin commented 4 years ago

Hey @dweemx ,

I think these steps will all need to be implemented within the pySCENIC code (CLI) itself. Otherwise the pipeline has the overhead of zipping and unzipping these files. It shouldn't be too much work, I could probably get to it next week.

The final point, about AUCell outputting a full loom with expression data is probably the most important. We can change the AUCell output to a tsv matrix, which would avoid writing the expression data again. But this might require significant changes to the multi-runs section of the pipeline.

dweemx commented 4 years ago

Hi @cflerin , I was not going to implement the compression here since it's already available by pySCENIC. It just a matter of properly calling pySCENIC. Regarding AUCell, yes I had in mind exactly what you suggested

cflerin commented 4 years ago

No, it won't work, I think, unless I'm totally misunderstanding what you're planning. pySCENIC CLI wont recognize files with gz ending, even though pandas is capable of reading them:

 pyscenic.cli.pyscenic - ERROR - Unknown file format for "adj.tsv.gz".

Likewise, if you want AUCell to have a tsv(.gz) output with loom input, it will require (minor) changes to the code.

dweemx commented 4 years ago

You're right now I see it in the code, this piece of code should be change I guess: https://github.com/aertslab/pySCENIC/blob/d3120aff123c87c1dbeb5b5e1030cb9a4210836d/src/pyscenic/cli/utils.py#L196-L198 Because, pandas allows to load files .tsv.gz, ... like you said

cflerin commented 4 years ago

Hi @dweemx ,

I've patched pySCENIC to enable (optional) file compression for all intermediate files (included in new release 0.10.1, and a new Docker image is available). Now, if you give a .gz ending in the filename argument when calling pySCENIC (adj.tsv.gz, reg.csv.gz), the file will be read/written as a compressed file.

I've started a branch feature/28-add_intermediate_file_compression with some basic steps toward this.

cflerin commented 3 years ago

Merged this partially completed feature in 63eb95286b28729983d7a3b9234887381dd7dec9.