single-cell-genetics / cellSNP

Pileup biallelic SNPs from single-cell and bulk RNA-seq data
Apache License 2.0
74 stars 11 forks source link

output sparse matrix #7

Closed bio-la closed 4 years ago

bio-la commented 4 years ago

hi, what do AD, DP and OTH stand for (variant x cell sparse matrix output #6 ) thanks!

huangyh09 commented 4 years ago

AD is the count for ALT allele; DP is ALT + REF alleles; and OTH is for other alleles except ALT and REF. There are annotated in the VCF file.

alexisweber commented 2 years ago

Could you please elaborate on the AD, DP and OTH sparse matrix format? I'm having trouble finding documentation on what each column values indicate. Please correct me if I'm wrong, but based on my reading of coordinate sparse market matrices, the first column is a cell ID number (integers repeating 1-n, total number of cells), the second column is a specific genotype location/ID and the third column is the allele status (integers either 1 or 2)? Any clarification will be greatly appreciated, thank you!

hxj5 commented 2 years ago

Thanks for the question. In AD/DP/OTH matrix, the first column is the SNP ID (integers repeating 1-m, total number of detected/output SNPs); the second column is the cell ID (integers repeating 1-n, total number of cells); and the third column is the UMI/read count (integers that could be other than 1 or 2, eg, 10).

alexisweber commented 2 years ago

Okay, that makes sense, thank you for the clarification! In lieu of that format, I was wondering if there was a way to merge these matrices? Due to constraints by my hpc cluster, I was only able to prepare cellSNP files separately by chromosome and I'm looking for ways to merge these output files in preparation for running vireo. Is this possible?

hxj5 commented 2 years ago

The SNP IDs in the output matrix are based on the order of the output SNPs in the cellSNP.base.vcf.gz. And the cell IDs are based on the order of the output barcodes in cellSNP.samples.tsv. So You may firstly merge all cellSNP.base.vcf.gz files and cellSNP.samples.tsv separately, to get new SNP IDs and cell IDs. And then merge the output matrices based on the matched SNPs or barcodes.

There could be a simpler way. As vireo -c option also accepts the vcf file (containing genotypes) as input. If you have separate cellSNP.cells.vcf.gz files, you could directly merge these vcf files and use the merged vcf, instead of the output matrices (cellSNP folder with sparse matrices), as input for vireo.

wangsky137 commented 1 year ago

Hi, I am wondering if AD/DP/OTH matrix, the AD count for a SNP showing cell ID is the count for that one cell or all cells merged. Is there a way we can get to know which cells contribute to the AD matrix without looking at the cells.vcf ?

hxj5 commented 1 year ago

Hi, in AD matrix file (cellSNP.tag.AD.mtx), each line shows the AD count (the third column) for specific one SNP (the first column is SNP ID) and one cell (the second column is cell ID). To find the cells contribute to the AD counts, you may first get the cell IDs from the second column and then extract the cell barcodes from the file cellSNP.samples.tsv with those cell IDs.

The matrix file is in Matrix Market format. Details can be found here.

wangsky137 commented 1 year ago

Hi, I want to confirm when I use mode 1a or 1b with --genotyping, is the caller calling mutation based on pooled cells like AD and DP are from pooled cells in cellSNP.cell.vcf in the first infor section and with specific cell AD and DP in the following column?

hxj5 commented 1 year ago

yes, for each single sample (cell), the genotype is inferred using its specific AD and DP.

wangsky137 commented 1 year ago

Just to be clear the .base.vcf file listed pooled cell called vcf and .cell.vcf gives information about individual cell
And the AD DP OTH's first column SNP ID is according to the .base.vcf. is that correct?

A following up question: I have seen a lot of SNP is the .base.vcf. have 0 AD

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

1 629906 . C T . PASS AD=56;DP=61;OTH=2 1 632644 . A G . PASS AD=0;DP=42;OTH=0 1 946247 . G A . PASS AD=2;DP=52;OTH=0 1 1255143 . C T . PASS AD=0;DP=27;OTH=0 like this

Is it because I provided the reference? and it should not be called a mutation in the sample?

hxj5 commented 1 year ago

Hi, the first paragraph is correct. For the question, SNP being homozygous (REF allele being the major allele) is one possible reason for AD=0. Allele imbalance, copy number variations or technical factors (such as allele dropout) could also lead to AD=0. IMO, how to define a mutation depends on your research question, you may adjust the minor allele frequency to filter SNPs (--minMAF option).