Closed bio-la closed 5 years ago
AD is the count for ALT allele; DP is ALT + REF alleles; and OTH is for other alleles except ALT and REF. There are annotated in the VCF file.
Could you please elaborate on the AD, DP and OTH sparse matrix format? I'm having trouble finding documentation on what each column values indicate. Please correct me if I'm wrong, but based on my reading of coordinate sparse market matrices, the first column is a cell ID number (integers repeating 1-n, total number of cells), the second column is a specific genotype location/ID and the third column is the allele status (integers either 1 or 2)? Any clarification will be greatly appreciated, thank you!
Thanks for the question. In AD/DP/OTH matrix, the first column is the SNP ID (integers repeating 1-m, total number of detected/output SNPs); the second column is the cell ID (integers repeating 1-n, total number of cells); and the third column is the UMI/read count (integers that could be other than 1 or 2, eg, 10).
Okay, that makes sense, thank you for the clarification! In lieu of that format, I was wondering if there was a way to merge these matrices? Due to constraints by my hpc cluster, I was only able to prepare cellSNP files separately by chromosome and I'm looking for ways to merge these output files in preparation for running vireo. Is this possible?
The SNP IDs in the output matrix are based on the order of the output SNPs in the cellSNP.base.vcf.gz
. And the cell IDs are based on the order of the output barcodes in cellSNP.samples.tsv
. So You may firstly merge all cellSNP.base.vcf.gz
files and cellSNP.samples.tsv
separately, to get new SNP IDs and cell IDs. And then merge the output matrices based on the matched SNPs or barcodes.
There could be a simpler way. As vireo -c
option also accepts the vcf file (containing genotypes) as input. If you have separate cellSNP.cells.vcf.gz
files, you could directly merge these vcf files and use the merged vcf, instead of the output matrices (cellSNP folder with sparse matrices), as input for vireo.
Hi, I am wondering if AD/DP/OTH matrix, the AD count for a SNP showing cell ID is the count for that one cell or all cells merged. Is there a way we can get to know which cells contribute to the AD matrix without looking at the cells.vcf ?
Hi, in AD matrix file (cellSNP.tag.AD.mtx
), each line shows the AD count (the third column) for specific one SNP (the first column is SNP ID) and one cell (the second column is cell ID). To find the cells contribute to the AD counts, you may first get the cell IDs from the second column and then extract the cell barcodes from the file cellSNP.samples.tsv
with those cell IDs.
The matrix file is in Matrix Market format. Details can be found here.
Hi, I want to confirm when I use mode 1a or 1b with --genotyping, is the caller calling mutation based on pooled cells like AD and DP are from pooled cells in cellSNP.cell.vcf in the first infor section and with specific cell AD and DP in the following column?
yes, for each single sample (cell), the genotype is inferred using its specific AD and DP.
Just to be clear the .base.vcf file listed pooled cell called vcf and .cell.vcf gives information about individual cell
And the AD DP OTH's first column SNP ID is according to the .base.vcf.
is that correct?
A following up question: I have seen a lot of SNP is the .base.vcf. have 0 AD
1 629906 . C T . PASS AD=56;DP=61;OTH=2 1 632644 . A G . PASS AD=0;DP=42;OTH=0 1 946247 . G A . PASS AD=2;DP=52;OTH=0 1 1255143 . C T . PASS AD=0;DP=27;OTH=0 like this
Is it because I provided the reference? and it should not be called a mutation in the sample?
Hi, the first paragraph is correct. For the question, SNP being homozygous (REF allele being the major allele) is one possible reason for AD=0. Allele imbalance, copy number variations or technical factors (such as allele dropout) could also lead to AD=0. IMO, how to define a mutation depends on your research question, you may adjust the minor allele frequency to filter SNPs (--minMAF option).
hi, what do AD, DP and OTH stand for (variant x cell sparse matrix output #6 ) thanks!