paulhager / smart-phase

A comprehensive and intelligent clinical phasing tool
GNU General Public License v3.0
13 stars 2 forks source link

VCF output question #7

Closed mattjmeier closed 3 years ago

mattjmeier commented 3 years ago

Hello,

I am glad I found your software for the advantages it offers over other phasing methods.

I am using it to phase de novo variants on a human cohort in multiple families. As far as I can tell, the software is running as expected, and outputs a TSV file successfully for each of my trios.

However - the main thing I'm interested in doing is determining which de novo variants can be assigned to their maternal or paternal origin.

My first question is - in the VCF output, should I expect it to conform to the convention where the first allele is paternal and the second allele is maternal?

Second... I'm not sure exactly what is happening but the VCF output only gets the SPGT format tag assigned to chromosome 1. Here is the code I'm running:

java -jar smartPhase.jar \ -a ${vcf} \ -p ${kid} \ -g validation_positions.500bp.up_and_downstream.bed \ -r ${bam1},${bam2} \ -m 30,30 \ -d ${family}.ped \ -o ${kid}.tsv \ -x \ -t \ -vcf \ -c 0.1 \ --physical-phasing 2>&1 | tee ${kid}.smartphase.log

A sample ped file: 1004_21 1004021 0 0 1 0 1004_21 1004022 0 0 2 0 1004_21 1004023 1004021 1004022 2 0

I have confirmed that the BAM files have coverage in all regions of the genome, and that the VCF listing all genotype variants for the trio also includes an even distribution... I'm quite puzzled by all this.

Anyway I'm not sure that anything can be done at your end because there are no errors thrown... the VCF gets written as expected, but just doesn't have the phasing information anywhere except chromosome 1.

The VCF is from GATK HC, and the BAM files are produced according to GATK best practices.

Let me know if you have any idea what might be happening!

Thanks, Matt

timjeske commented 3 years ago

Dear Matt,

thank you for your interest in SmartPhase!

Yes, we stick to the convention that the first allele is paternal and the second allele is maternal when the genotype is phased.

Without having a look at the files, it's hard to tell why you can find phased GTs only for chromosome 1. Does your BED file also contain regions of other chromosomes? If you can provide us some minimal example files, we could try to find out what causes your observation.

Kind regards, Tim

mattjmeier commented 3 years ago

Hi Tim,

Thanks for the clarification on allele ordering.

The BED file does indeed contain loci across the whole genome. I've attached it here (wouldn't let me use the bed extension so I zipped it).

For the BAM and VCF minimal examples... not sure about our policy on this because it's human data. I'll have to ask my supervisor about how to approach it.

How many reads would you need in a "minimal" BAM file? Downsampled using Picard? And for the VCF, I guess you'd need all samples in the trio, but downsampled somehow as well? Do you happen to know the size limit for github attachments?

[edited to delete wrong bed file]

mattjmeier commented 3 years ago

That was the wrong bed file... sorry.

Correct one attached as txt.

Thanks again Matt validation_positions.500bp.up_and_downstream.txt

timjeske commented 3 years ago

Hi Matt,

the BED file looks fine. Maybe you could also send us the log file of SmartPhase? Maybe we can find the issue although no error was thrown.

Best regards Tim

mattjmeier commented 3 years ago

Sure, thanks for taking a look. I've attached an example log.

I was curious about another thing in the log - it seems that the "trio size" varies for each interval, sometimes greater than 3, what does that represent? I couldn't find it in the documentation or the paper.

Thank you again Matt 2018040.smartphase.log

paulhager commented 3 years ago

The trio size lets you know how many of the variants ended up being trio phased.

It seems that SmartPhase is also really phasing in all of the other contigs. Weird that it isn't outputting its results in the final vcf. Are you receiving the results as expected for all contigs in the tsv? Or is SmartPhase only providing output for Chr 1 in all output files?

mattjmeier commented 3 years ago

I see - that makes sense.

Yes, the TSV looks perfectly fine. Is there a way to take the TSV and the VCF after it's done and retry just that part? I'm not sure what is happening in the back end to produce the VCFs - I don't know java well unfortunately.

Thank you Matt

paulhager commented 3 years ago

I changed the logging for the message because it admittedly didn't make much sense without knowing the code, thanks for bringing it up.

I'm taking a look at the code for the writing of the vcf right now. Does SmartPhase give you any output for the other contigs? Is it the original input just without the SPGT tag or does the vcf end at the last variant of chr 1?

mattjmeier commented 3 years ago

I believe it contains all the original input, yes (but without the SPGT tag). I can confirm later today, I don't have access to the server at the moment.

I wonder if it has anything to do with chr prefixes? Just a shot in the dark but I did notice that the TSV sometimes has the chr prefix but the first column doesn't.

paulhager commented 3 years ago

I think it was a problem in the writing of the vcf. There were some edge cases by the switch from one interval to the next that I didn't think of. I've created a new branch with a fix that I've tested here locally but I can't quickly create all possible scenarios. Can you try switching to the write-vcf branch (https://github.com/paulhager/smart-phase/tree/write-vcf) and re-running your setup? If runtime is too high, feel free to cut your data to just the transitions from one contig to the other, but it seems from your log it only took a minute, so it should be fine.

mattjmeier commented 3 years ago

Ah, thank you so much!! That did indeed write the SPGT/SPID tags to the whole file.

One more question for you if you have a sec... Is the "denovo" output from the log meant to denote variants that weren't in the VCF or is it meant to indicate a mendelian violation?

I really appreciate you guys taking the time to help. Thanks again!

Matt

paulhager commented 3 years ago

Denovo indicates positions where neither father nor mother had one of the alleles found in the patient i.e. child has A/T in VCF and mother is A/A and father is A/A (or C/C or w/e).

paulhager commented 3 years ago

I just saw that the last line of the vcf wasn't being printed correctly. I fixed and pushed to the write-vcf branch again. Please pull it again.

Thanks for bringing this to our attention and if you find anything else feel free to open another issue!

mattjmeier commented 3 years ago

Oh, thank you for telling me. I pulled it again as you suggested. And thanks for the info about the de novo mutations.

No problem, I will let you know if anything else comes up!