Open ivanov-v-v opened 5 years ago
Very good point. The reason we used ".:.:.:.:.:." is to keep the same format (i.e., the same number of tags) even it is missing. I will check if common R/Python packages processing VCF files is compatible with "." for missing values. If positive, this indeed will save a lot of space.
Alternatively, from v0.1.6, it supports saving to sparse matrices for AD
, DP
, OTH
tags.
please use -O OUT_DIR instead of -o OUT_FILE.vcf.gz.
Also, you can use sparseVCF.py to convert existing VCF.gz into sparse matrices.
As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:." (11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.