jon-xu / scSplit

Genotype-free demultiplexing of pooled single-cell RNA-Seq, using a hidden state model for identifying genetically distinct samples within a mixed population.
MIT License
39 stars 9 forks source link

VCF from gatk #6

Closed yh154 closed 4 years ago

yh154 commented 4 years ago

Hi,

I am trying to test scSplit on my data composed by 4 samples, while as I have SNPs already called using GATK4 haplotypecaller, I would like to skip the SNP calling step. I got an error message when trying to run "scSplit count" because I don't have 'GL' value in my VCF file. Values in my VCF under "FORMAT" are "GT:AD:DP:GQ:PL". I am wondering would scSplit be able to use other values instead of GL for analysis or do you have plan to support various VCF format in the future?

Example of other values available in my VCF file under "INFO": "ExcessHet=3.0103;FS=0;MQ=60;QD=22.44;SOR=1.609;DP=1;AF=1;MLEAC=1;MLEAF=0.5;AN=2;AC=2"

Error mesaage I got Traceback (most recent call last): File "/usr/local/bin/scSplit", line 402, in count item = mixed_vcf.iloc[0].loc['FORMAT'].split(':').index('GL') ValueError: 'GL' is not in list

Thanks!

jon-xu commented 4 years ago

Hello yh154,

Thanks for raising up this issue!

I have extended the functionality so that the script checked for GL/GP/PL/GT in order to filter out homozygous SNVs when building the matrices.

Please download the updated version of scSplit from Github. I'll put it in next release once it's fully tested.

Thanks, Jon

yh154 commented 4 years ago

Hi Jon, Thank you very much for your prompt reply! I've got a new error at demultiplexing step as followings:

scSplit run -r ref_filtered.csv -a alt_filtered.csv -n 4

Traceback (most recent call last): File "/usr/local/bin/scSplit", line 668, in scSplit() File "/usr/local/bin/scSplit", line 354, in init getattr(self, args.command)() File "/usr/local/bin/scSplit", line 534, in run model = core(num) File "/usr/local/bin/scSplit", line 468, in core model.assigned, model.initial, model.model_af, model.P_s_c = assigned, initial, af, p_s_c UnboundLocalError: local variable 'assigned' referenced before assignment

Thanks a million in advance! Cheers, yh154

jon-xu commented 4 years ago

Hi yh154,

Thanks for your feedback! Glad to see you moved on to the next step.

For the new error, could you please confirm whether there's some non-zero values in your newly built ref_filtered.csv and alt_filtered.csv, please?

Jon

jon-xu commented 4 years ago

It will be also helpful if you could send me your scSplit.log for a check.

Thanks!

yh154 commented 4 years ago

Hi Jon,

Enclosed by the end of this message is the scSplit.log content. I confirmed that ref/alt_filtered.csv contain non-zero values. While, I suspect the fact that I have a very unique testing library composing only 4 valid cells. The corresponding bam came from a 10x run with ~80 valid barcodes in total, but most of the cells are artifacts.

============ The scSplitlog:

/usr/local/bin/scSplit run -r ref_filtered.csv -a alt_filtered.csv -n 4 Loading allele count matrices: 2019-11-11 12:08:57.900791 Allele counts matrices loaded: 2019-11-11 12:08:58.090615 Repeat 1 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 2 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 3 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 4 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 5 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 6 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 7 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 8 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 9 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 10 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 11 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 12 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 13 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 14 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 15 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 16 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 17 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 18 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 19 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 20 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 21 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 22 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 23 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 24 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 25 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 26 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 27 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 28 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 29 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization. Repeat 30 in demultiplexing 5 subpopulations (including doublets if expected) Building model... Not enough informative cells to support model initialization.

====================== top lines from ref_filtered.csv

SNV,AGATCGCA,AGCAGGAA,AGTCACTA,ATCCTGTA 10:1000772,0,0,1,0 10:100986474,6,3,5,0 10:101601331,3,4,6,0 10:101608897,11,3,6,0 10:102067980,1,1,3,0 10:102476724,0,0,1,0 10:102476785,0,0,4,0 10:102476795,0,0,3,0 10:102476883,0,1,5,1

====================== top lines from alt_filtered.csv

SNV,AGATCGCA,AGCAGGAA,AGTCACTA,ATCCTGTA 10:1000772,1,0,1,0 10:100986474,2,4,0,7 10:101601331,0,2,0,5 10:101608897,2,0,0,5 10:102067980,6,2,0,0 10:102476724,1,0,3,0 10:102476785,0,0,5,0 10:102476795,0,0,5,0 10:102476883,3,0,7,0

jon-xu commented 4 years ago

Hi yh154,

Thanks for providing the information!

If you have only 4 cells, why do you need to demultiplex them, may I try to understand?

The problem here is that if not specified, scSplit will assume there’s a doublet cluster, so when you ask for demultiplexing into 4 groups, it’s actually trying to split into 5 groups with one additional doublet cluster. But since you have only 4 cells, scSplit doesn’t have enough cells to be split into 5 groups.

If you specify “-d 0”, scSplit will not add doublet cluster, but again if you have only 4 cells and ask for 4 groups, it’ll be just one cell for each group.

Cheers

yh154 commented 4 years ago

Hi Jon,

This small library is a test of concept experiment we performed earlier for some other purpose. When I ran into problem in my very first message to you (vcf from gatk), in the mean time, I quickly switched to select this small old library to submit to freebayes for SNP calling in order to move on with scSplit. Thanks for your quick response and explanation and I think the error message was overcome, while I indeed have another issue which I think attribute to different Python version (I have python is 3.7.4)

/usr/local/bin/scSplit:280: FutureWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing

See the documentation here: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated alt_or_ref = alt_or_ref.ix[[x for x in alt_or_ref.index if x[0] not in ['X','Y','MT']]]

On the separate note, the updated scSplit threw out new error messages on my gatk vcf input, which I may bring it up to you later.

Cheers!

jon-xu commented 4 years ago

Thanks for pointing out that .ix is deprecated in future versions. Will make a correction for that.

Feel free to let me know what the new issue about GATK is.

Cheers,

Jon

yh154 commented 4 years ago

Hi Jon,

Thanks a million for your response!

Enclosed here is the error message from using gatk vcf format. The VCF file is merged from 4 vcfs called separately for each library/sample.

=========== scSplit count error message:

Traceback (most recent call last): File "/usr/local/bin/scSplit", line 668, in scSplit() File "/usr/local/bin/scSplit", line 354, in init getattr(self, args.command)() File "/usr/local/bin/scSplit", line 430, in count snv_idx = [indexes for indexes, RA in enumerate([float(x.split(':')[pl_index].split(',')[1]) for x in filtered_vcf.iloc[:,-1]]) if RA == 0] File "/usr/local/bin/scSplit", line 430, in snv_idx = [indexes for indexes, RA in enumerate([float(x.split(':')[pl_index].split(',')[1]) for x in filtered_vcf.iloc[:,-1]]) if RA == 0] IndexError: list index out of range

================= Two lines of VCF

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4

1 184991 . G A 48.6 PASS BaseQRankSum=-1.707;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=3.74;ReadPosRankSum=0.085;SOR=0.124;DP=13;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL 0/1:10,3:13:56:56,0,327 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 1 1014228 . G A 136.93 PASS ExcessHet=3.0103;FS=0;MQ=60;QD=22.82;SOR=3.912;BaseQRankSum=-1.282;MQRankSum=0;ReadPosRankSum=-1.036;DP=14;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=5 GT:AD:DP:GQ:PL 1/1:0,6:6:18:151,18,0 1/1:0,3:3:9:81,9,0 ./.:.:.:.:. 0/1:3,2:5:42:42,0,84

Cheers,

jon-xu commented 4 years ago

No problem! Thanks for the great questions.

Here I can see you merged 4 vcfs so that each SNV contain genotype from each sample.

But scSplit expect a vcf called on the mixed sample thus with only one mixed genotype for the mixed population.

This is why scSplit is “genotype-free” because you don’t need to know individual sample genotypes.

So you might need to run GATK on the mixed sample BAM and use that VCF as input for scSplit count.

Cheers!

jon-xu commented 4 years ago

Hi yh154,

How's things going? Did you get the results? Shall we close this issue?

Cheers.

yh154 commented 4 years ago

Hi Jon, Thanks for your following-up! I have a large experiment containing ~25000 cells, so it took a while to get vcf file from it. I am currently in the middle of the last step "scSplit run". I will let you know if I could successfully finish the job.

Cheers

yh154 commented 4 years ago

Hi Jon, scSplit finished successfully, while it took awhile on my large dataset. You may close this thread. Thank you very much!

jon-xu commented 4 years ago

Great to know!

Thanks for helping scSplit get improved as well!