jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
39 stars 9 forks source link

Issues with scsplit count #24

Closed yilevine closed 9 months ago

yilevine commented 1 year ago

Hello!

I was trying to use scSplit on snRNA-seq data with 4 mixed samples. But I always got errors with the code below.

scSplit count -c $VCF/common_snvs_hg38_scsplit -v $SCSPLIT_OUTDIR/freebayes_region_var_qual30.recode.vcf -i $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted_named.bam -b $BARCODES/barcodes_merged_IDCM_20pc.tsv -r $SCSPLIT_OUTDIR/ref_filtered.csv -a $SCSPLIT_OUTDIR/alt_filtered.csv -o $SCSPLIT_OUTDIR

error

I suspected the problem might be that the vcf file is not one sample but rather a mixed sample mentioned in #6 and #19. So I checked my vcf file which was generated by freebayes freebayes-parallel targets.regions 27 -f $FASTA/genome.fa $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted_named.bam>freebayes_region_var.vcf

vcf

Meanwhile, I also checked my bam file.

bam_file

Therefore, I believe the problem is my vcf file generated by freebayes.

But I didn't understand why I got a mixed vcf file Do you have any idea on how to fix it?

Thanks! Yile

jon-xu commented 1 year ago

Hi Yile,

I guess it's related with your RG tag in the BAM file? Check this out and you might want to contact freebayes for more details: https://github.com/freebayes/freebayes/blob/master/README.md

Jon

yilevine commented 1 year ago

Hi Jon,

Thanks for your reply. I will check it out.

Best, Yile

yilevine commented 1 year ago

Hi Jon,

I have solved the RG tag problem. Below are my new bam file and vcf file. bam vcf

So I moved forward. However, I have another problem when running scSplit count. singularity exec --bind $SINGULARITY/heart $SINGULARITY/Demuxafy.sif scSplit count -c $VCF/common_snvs_hg38_scsplit -v $SCSPLIT_OUTDIR/freebayes_region_var_single_qual30.recode.vcf -i $SCSPLIT_OUTDIR/filtered_bam_dedup_sorted_named_RG.bam -b $BARCODES/barcodes_merged_IDCM_20pc.tsv -r $SCSPLIT_OUTDIR/ref_filtered.csv -a $SCSPLIT_OUTDIR/alt_filtered.csv -o $SCSPLIT_OUTDIR I allocated 150 GB of memory and 30 hours. But I didn't get the results. Below is the scSplit.log

sc

I don't know what is the problem with my input files. Or should I let it run longer?

Thanks. Yile

jon-xu commented 1 year ago

Sorry Yile for the late reply. I’m not sure about demuxafy. But your number of barcodes and positions look both quite big, which might lead to the performance issue.

Please: 1) check the freebase parameters and filter criteria mentioned in our documentation - most cases we see around 100,000 informative SNVs. 2) use the most updated version of scSplit on GitHub directly, as we solved some performance issue in the newest release.

Hope it helps, but it’s true that we didn’t test that many cells in our development.

yilevine commented 1 year ago

Hi Jon,

Thanks very much for your reply. The problem was indeed the number of cells. After filtration, I successfully ran,

Best, Yile