jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
39 stars 9 forks source link

Issue with scSplit count function... #19

Closed SathiyaNManivannan closed 2 years ago

SathiyaNManivannan commented 2 years ago

Hello!

I am trying to use scSplit on single-cell data with pooled samples from mouse embryos. The data was processed using 10x Cell Ranger. The output bam file was processed per instructions on the GitHub page and the vcf file was generated using freebayes. I then used bcftools to filter the vcf files based on quality.

When I try to use scSplit on the vcf file, I keep running into an error!

Traceback (most recent call last): File "scSplit/scSplit", line 699, in <module> scSplit() File "scSplit/scSplit, line 357, in __init__ getattr(self, args.command) () File "scSplit/scSplit", line 460, in count raise ValueError('Empty matrices!') ValueError: Empty matrices! Python --version Python 3.8.2

Please help with troubleshooting this issue!

Thanks,

Sincerely,

Sathiya

`

jon-xu commented 2 years ago

Sathiya,

According to the guideline: g) If this step fails, please check: 1) is your barcode tag in the BAM files "CB" - if not, you need to specify it using -t/--tag; 2) are you working on a mixed sample VCF rather than a simple merge of individual genotypes? 3) is the correct whitelist barcode file being used? The whitelist should be the trusted barcodes from your sequencing result, not the whole barcode library of the sequencing protocol.

BTW, our tool was tested for 8 or less mixed samples only. And also it is recommended you remove the doublets using other doublet finding tools before running scSplit to improve the accuracy.

SathiyaNManivannan commented 2 years ago

Message ID: @.***>

Jon-Xu,

Thank you for your response. I went through my bam file and found that the barcodes are indicated using "CB:Z:". Is ":Z:" a problem? Also, I checked the other troubleshooting steps including checking barcodes of the filtered output of 10x for the specific samples. We are expecting about 6 samples in the data mixed together. We removed doublets from the barcode whitelist and filtered the bam file for the retained barcodes.

example of line from bam file:

K00400:128:HC2Y3BBXY:1:1115:21156:5939 16 1 3063419 255 111M30S * 0 0 CAGATCTCATTATGGGTAGTTGTGAGCTACCATGTGGTTGCTGGGATTTGAACTCAGGACCTTCGGAAGAGCAGTCGGATGCTCTTACCCACTGAGCCATTTCACCAGCCCCCCATGTACTCTGCGTTGATACCACTGCTT FAJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJFJJJJJJJJJJJFFFAA NH:i:1 HI:i:1 AS:i:109 nM:i:0 RE:A:I xf:i:0 li:i:0 BC:Z:CCAGGAGC QT:Z:AAFFFJJJ CR:Z:TGGCGCACACGAAACG CY:Z:AAFFFJJJJJJJJJJJ CB:Z:TGGCGCACACGAAACG-1 UR:Z:CATTTTAAAT UY:Z:JJJJJJJJJJ UB:Z:CATTTTAAAT RG:Z:mat_conE9_5:0:1:HC2Y3BBXY:1

Kindly, let me know if there is anything else that I need to do, in order to overcome this empty matrix issue.

Thanks,

Sincerely,

Sathiya

jon-xu commented 2 years ago

Hi Sathiya,

Sorry for replying late. Could you please show me a few lines of your vcf file, pls?

Thanks, Jon