jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
41 stars 9 forks source link

syntax error scSplit count #10

Closed dwkarjosukarso closed 4 years ago

dwkarjosukarso commented 4 years ago

Hi, I am trying out the tool to use in my research. I am using the following command: python /home/karjosukarso/scSplit/scSplit scSplit count -v CTRL1_merged_filtered.recode.vcf -i CTRL1_merged_deduplicated.bam -b whitelist.txt -c /home/karjosukarso/genome_ref/common_snvs_hg38 -r ref_filetered.csv -a alt_filtered.csv -o results

The following error appear: File "/home/karjosukarso/scSplit/scSplit", line 459 base_calls_mtx[1].to_csv(os.path.join(args.out, args.alt) ^ SyntaxError: invalid syntax

I checked the version of python in the conda env I created for scSplit (python --version) and it is python 3.6.

Do you have any idea what I could do to resolve this error? Thank you, Dyah

jon-xu commented 4 years ago

Hi Dyah,

Sorry for the bug, please try the most updated v1.0.7.

Thanks for pointing it out! Jon

dwkarjosukarso commented 4 years ago

Hi Jon, thanks for your quick reply!

I have tried v1.0.7 and now have different error: Traceback (most recent call last): File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 695, in scSplit() File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 357, in init getattr(self, args.command)() File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 456, in count raise ValueError('Empty matrices!') ValueError: Empty matrices!

Do you think this is also a bug or something wrong from my input files?

Thanks, Dyah

jon-xu commented 4 years ago

Hi Dyah,

This is not a bug.

For a situation of empty matrixes, please check: 1) is your barcode tag in the BAM files "CB" - if not, you need to specify it using -t/--tag; 2) are you working on a mixed sample VCF rather than a simple merge of individual genotypes? 3) is the correct whitelist barcode file being used?

Cheers,

Jon

dwkarjosukarso commented 4 years ago

Hi Jon, thanks for the help.

I think it is also good to mention that I am working with single cell data generated from Celseq2 (plate-based format). This is what was performed before scSplit

  1. demultiplexing and mapping using Celseq2 pipeline
  2. merge the sam files of each cell generated from Celseq2 pipeline
  3. sam to bam, sort and umi dedup
  4. generate vcf file (the sample is a mixed from 5-8 donors, we have to check the exact number)

I checked the BAM file I used for scSplit count and it does not really have any barcode tag, I think because of the demultiplexing already performed by Celseq2 pipeline. NS500173:399:HJ2MCBGX5:1:11106:7631:4324:UMI:GTGGAAAC: 0 chr11 65211863 23 50M * 0 0 ATGTCTTAGTGATTTTAAAGGGGACTCTTCAGGGACTTGTGTACTGGTTA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:-15 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:0T1A1A45 YT:Z:UU NS500173:399:HJ2MCBGX5:1:11106:4610:8169:UMI:TCCTCAGG: 0 chr9 6428994 24 50M * 0 0 GACTCTGCTAAAAATATAAAAGGTTAGACACTATGGTGGTGGATACCTGT AAAAAEEEA/EE/EEEE/EEEEEEEEEEE/EEEEEEAEEEEEEEE/EEEE AS:i:-10 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:0A0T48 YT:Z:UU

I am aware that scSplit is created generally for droplet-based sequencing, but any chance it can be used for a plate based data? Maybe with a different processing before scSplit?

Thanks a lot!

Cheers, Dyah

jon-xu commented 4 years ago

Hi Dyah,

If you have one SAM per cell, when you merge them into one BAM, you need to add the cell barcode in your script (for example with XC tag, which you need to specify later in scSplit count).

I’ve tested that and it should work.

Cheers,

Jon

dwkarjosukarso commented 4 years ago

Hi Jon, sorry for asking again. I managed to have CB tag in the BAM file, but it is still giving the same error.

My current BAM:

NS500173:399:HJ2MCBGX5:1:11312:3277:11864 0 chr1 629268 12 63M * 0 0 ACTTCTGGCCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACT AAAAAEE/EEAEAEEEEEEEEEAEE/EEEEEEEEEAEEAA/A<EEE<EAEAAAAEEEAA//E< CB:Z:GCAAGAAG UB:Z:TTGCAACC MD:Z:7A55 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-3 XS:i:-8 YT:Z:UU NS500173:399:HJ2MCBGX5:4:11508:19261:10847 0 chr1 629268 30 63M * 0 0 ACTTCTGACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACT AAAAAEEEEAEAE<EE<EA6EEAE/E/EEEAEEEEEEAEE/A/EEAAEEEEEEEEEAAEEAEE CB:Z:TCGGTTGA UB:Z:GACGCAAT MD:Z:63 XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-5 YT:Z:UU NS500173:399:HJ2MCBGX5:4:11607:24910:14366 0 chr1 629268 30 63M * 0 0 ACTTCTGACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACT AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEAEE<EEAEEE<</AE/AEAE/EEEEEEA//<E/ CB:Z:GAGGGTAG UB:Z:GAGAGAGA MD:Z:63 XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-5 YT:Z:UU

The vcf file:

fileformat=VCFv4.2

FILTER=

fileDate=20200407

source=freeBayes v1.3.2-dirty

reference=/home/karjosukarso/genome_ref/GRCh38.primary_assembly.genome.fa

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

cell barcode whitelist CGTCTAAT AGACTCGT GCACGTCA TCAACGAC ATTTAGCG ATACAGAC TGCGTAGG TGGAGCTC TGAATACC TCTCACAC TACTGGTA

command: /home/karjosukarso/scSplit-1.0.7/scSplit count -v CTRL1_merged.vcf -i INV_JPGIII195-CTRL1_tag_sorted_dedup.bam -b whitelist.txt -c /home/karjosukarso/genome_ref/common_snvs_hg38 -r ref_filetered.csv -a alt_filtered.csv -o results

Error:

Traceback (most recent call last): File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 695, in scSplit() File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 357, in init getattr(self, args.command)() File "/home/karjosukarso/scSplit-1.0.7/scSplit", line 456, in count raise ValueError('Empty matrices!') ValueError: Empty matrices!

Anything I could change to make it work? Thank you again.

Cheers, Dyah

jon-xu commented 4 years ago

Could you please send me your vcf file?

dwkarjosukarso commented 4 years ago

I zipped it for sending purposes. CTRL1_merged.zip

jon-xu commented 4 years ago

Hi Dyah,

I tested on the data you sent me, it ran well. Although I manipulated some data to match reads and vcf, the file format (CB tag and vcf format) should be fine.

So here's some possible points you might want to further check: 1) are the barcode white list correctly covering the BAM file? Can you find some reads for any barcode in your list? 2) it's very unusual but do you have most GL(RA)<-0.004, which is GP(RA)<0.99 in your vcf rows? It should be the second item in the GL field, because if such it means your SNVs called from the mixed samples are not heterozygous, and thus we cannot use any to distinguish between different samples;

Please let me know. Thanks, Jon

dwkarjosukarso commented 4 years ago

Hi Jon, thanks for testing it.

  1. I checked the BAM files and I can find back the sequences behind CB tag in the whitelist file (tab separated)
  2. I am not completely sure of what you meant since I am not experienced with VCF files, but I tried filtering for heterozygous SNVs using bcftools (command: bcftools view -g het CTRL1_merged.vcf) and based on it I have heterozygous variants.

Thanks again, Dyah

jon-xu commented 4 years ago

ok, so basically "scSplit count" does below steps:

  1. read in the vcf file
  2. filter it with the common SNVs
  3. keep only heterozygous SNVs
  4. for each of the filtered SNVs, get all the reads overlapping the SNV
  5. for each of the reads covering the SNV, check if the read flag is smaller than 256
  6. get the barcode from the read and check if it exists in the barcode list
  7. if yes, check whether the SNV position in the read is REF or ALT and put into the count matrices
  8. check if the count matrices are empty

We have checked 3, 6 while 1, 4, 7, 8 was tested by nature. Although I still don't know why but you can further check on 2 and 5:

for 2: whether the common SNVs have intersection with the SNVs in your vcf for 5: how many of your read having read flag < 256?

If you still cannot find out the reason after that, I would suggest you send me all the input files for check, and if the BAM is too big for transfer, you can send me a proportion of it. And for data security, you can send me the link to download at jun.xu@uq.edu.au directly.

Thanks, Jon

jon-xu commented 4 years ago

Hi Dyah,

as we discussed in mail, the reason of your getting empty matrices is due to the inconsistent format of common SNV list and that in your data.

I have added a reminder into the software instructions - thanks for your help to locating the issue.

This issue is now closed. Jon