single-cell-genetics / cellSNP

Pileup biallelic SNPs from single-cell and bulk RNA-seq data
Apache License 2.0
74 stars 11 forks source link

Default nan-handling policy is a memory hog #5

Open ivanov-v-v opened 5 years ago

ivanov-v-v commented 5 years ago

As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:." (11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.

huangyh09 commented 5 years ago

Very good point. The reason we used ".:.:.:.:.:." is to keep the same format (i.e., the same number of tags) even it is missing. I will check if common R/Python packages processing VCF files is compatible with "." for missing values. If positive, this indeed will save a lot of space.

Alternatively, from v0.1.6, it supports saving to sparse matrices for AD, DP, OTH tags. please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz. Also, you can use sparseVCF.py to convert existing VCF.gz into sparse matrices.