single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

empty output using cellsnp-lite with 10x and 1000 genome vcf file #35

Closed songeric1107 closed 2 years ago

songeric1107 commented 2 years ago

trying to use cellsnp-lit to prepare the input for vireo, here is the example

cellsnp-lite -s sc1_raw/possorted_genome_bam.bam -b sc1_raw/barcodes.tsv.gz -O sc1_out -R ../ref/genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf -p 20 --minMAF 0.1 --minCOUNT 20 --gzip

however, the whole process takes a week to finish, but the result is empty.

Not sure what is the problem?

thank you

Yang

hxj5 commented 2 years ago

Hi, thanks for your feedback. For quicker debugging, could you share the following information?

  1. the version of cellsnp-lite;
  2. the whole log file;
  3. a few (e.g., 2) lines of your sam records;
  4. a few lines of your barcodes;
  5. a few lines of your vcf records;

Thank you! Xianjie

songeric1107 commented 2 years ago

thank you, XianJie.

please see below

1.cellsnp-lite Version: 1.2.2

  1. log file sc1.log

  2. bam sub.bam.txt

  3. barcode AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 AAACCTGAGAAACCTA-1 AAACCTGAGAAACGAG-1 AAACCTGAGAAACGCC-1

  4. vcf

1 10539 rs537182016 C A 100 PASS OLD=10539;AF=0.000599042 1 10642 rs558604819 G A 100 PASS OLD=10642;AF=0.00419329 1 11008 rs575272151 C G 100 PASS OLD=11008;AF=0.0880591 1 11012 rs544419019 C G 100 PASS OLD=11012;AF=0.0880591 1 11063 rs561109771 T G 100 PASS OLD=11063;AF=0.00299521

hxj5 commented 2 years ago

Hi, these files seem all right. For the bam file, could you show some properly mapped records? e.g., print a few records whose MAPQ>20 with samtools view <bam_file> | awk '$5 > 20' | head.

And is the bam file also hg38? as you are using a hg38 SNP list.

BTW, the running time is reasonable as you are using the raw dataset (which contains many unfiltered cells) and a big candidate SNP list. Usually, to speed up, we recommend to use the filtered output of cellranger (in outs/filtered_gene_bc_matrices) and a smaller candidate SNP list (e.g., the genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz in this link, which contains 7.4M hg38 SNPs with MAF>0.05).

hxj5 commented 2 years ago

To check if the issue is due to the filtering parameters being too strict, you may want to run the filtered dataset and the 7.4M snp list with --minMAF 0 --minCOUNT 1.

songeric1107 commented 2 years ago

Thank you for the suggestion, I change the parameter to what you suggested. one more question, for the vcf, the chromosome ID is 1,2,3....X, but the reference for alignment is keyed as chr1, chr2,....., will that be a problem?

songeric1107 commented 2 years ago

I tried again for chr1 using the small vcf, the result is still empty. the log has no errors. cellsnp-lite -s sc1_raw/possorted_genome_bam.bam -b sc1_raw/barcodes.tsv.gz -O sc1_chr1_small -R ../ref/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf -p 20 --minMAF 0 --minCOUNT 1 --chrom chr1 --UMItag Auto --gzip

songeric1107 commented 2 years ago

another question, my sample is 5’ scRNAseq from hashtag samples, not the regular 3' scRNA, will that matter?

songeric1107 commented 2 years ago

Xianjie,

I figure out the problem, the bam file is not in the "right format", I should use another format of bam, cmo/outs/multi/count/unassigned_alignments.bam, I tried with the 7.5M vcf file, it works, thank you