Closed Phillip-a-richmond closed 1 year ago
Hi @Phillip-a-richmond
Thank you for the report. You are correct, that the behavior out-of-the box for DeepTrio can currently be sub-optimal for the sex chromosomes in male samples. We've benchmarked some strategies for dealing with this.
In the short term, our recommendation is to run separate calling on the non-PAR regions of ChrX and ChrY, where only the mother sample is provided as the parent for calling of the son, and (less importantly as it is unclear whether this is an issue with chrY) only the father sample is provided for calling chrY on the son.
In the documentation, this is expressed in the following statement: "Also please note: for the non-PAR regions of the sex chromosomes (X and Y), we recommend running these providing only the parent who contributed the child's chromosome (e.g. for chromosomeX, only the mother and son samples and for chromosomeY only the father and son samples)."
The DeepTrio manuscript has benchmarks for this strategy in the following section (near the end of results):
The Genome in a Bottle truth sets do not contain chromosomeX or chromosomeY variants in a male individual. As a result, DeepTrio has never been trained with hemizygous sites. Because we train DeepTrio to perform duo calling, it is likely that DeepTrio would call variants on chromosomeX similar to how it would call a duo sample. To assess this, we ran DeepVariant and DeepTrio on chromosomeX of the son (HG002) and measured the number of heterozygous variant calls in the non-PAR regions of chromosomeX.
For DeepVariant, 4.45% (455/101866) of calls in non-PAR regions of chromosomeX are heterozygous. In DeepTrio, 24.5% (21633/88314) are heterozygous. This substantial difference suggests that applying DeepTrio directly to chromosomeX in male samples is problematic. Since chromosomeX in males is inherited from the mother, we performed calling on chromosomeX with only the mother provided as the parent. This reduced heterozygous calls to 3.37% (3518/104427), which is better than in the DeepVariant case. For male samples, this recommends that variant calling should be run with both parents on the autosomal and PAR regions using a BED file to restrict location, and additional variant calling should be performed using only the mother’s file provided as parent for the non-PAR regions of chromosomeX, and only the father’s provided for the non-PAR regions of chromosomeY.
This experiment indicates that allowing the model to infer a hemizygous chromosome through coverage and explicitly training for hemizygous variants is an opportunity for improvement, both for DeepVariant and DeepTrio.
Over the long term:
We anticipate that the T2T consortium will be shortly be making available complete assemblies of ChrX and ChrY for the HG002 sample. This should allow us to make training labels for ChrX and ChrY in the hemizygous case, and we have a few internal ideas around how to use this to make models that will generally take this into account.
I would be curious in your feedback about whether the short term solution is acceptable, whether you find it insufficient, or if you would recommend other approaches for us to address the issue.
Thank you, Andrew
Hi @Phillip-a-richmond
As an additional follow-up, I am wondering if my answer is off the mark, that I have discussed this with respect of accuracy of calling, while you would prefer fixes to the VCF representation. Is your main request that we have the ability to output a hemizygous representation of the VCF? This is something we are starting to think about for VCF and gVCF as we prepare to work with the T2T assemblies.
It's our anticipation that the sex-aware model would call these variants as 1/1 (since we will use that output class for hemizygous variants). It's also our anticipation that providing only the mother ChrX will result in a 1/1 call. However, we won't represent this specifically as hemizygous.
In order to resolve your issue, do we need to solve this hemizygous representation problem specifically, or will producing more 1/1 calls in the manner described in my prior comment be sufficient?
Hi Andrew,
Thanks for your response. There are a few items here which I'll address.
Yes proper representation on X chromosome is important, and it's caused this problem for downstream analyses of pathogenic hemizygous variants. Likely would be true for de novo hemizygous variants on male non-PAR chromosome X. If this were fixed sooner than later that would increase confidence in DeepTrio. Until that's fixed, I'll be shifting back to DeepVariant GVCF->GLNexus for all rare disease trio samples.
If a variant is present in 100% of the reads, representing as 1/1 would be ideal as this is what's expected by downstream tools. The expectation for a male is to have majority variants on X as 1/1. It would be dangerous if a female had majority 1/1.
Sounds like this involves a lot of mangling of samples and VCF cutting/manipulation...and no I'm not thrilled about having this pushed on the user. You're recommendation would be:
Essentially you're asking the user to run DeepTrio >2 times for it to work on a trio with a male proband, including manual VCF mangling on the user side. It would be much appreciated if this process can be internalized and parameterized within the run deepvariant command. Alternatively, if this is your best practice then providing code chunks to this effect would be appreciated.
At the end of the day, me, as a user, would prefer to run 1-2 commands to get a trio merged VCF. If you provided that code chunk to run the existing tool in the configuration you describe, then I'd be happy to insert it and test on my cases.
Thanks, Phil
Hi @Phillip-a-richmond
I understand agree that this workflow is cumbersome and far from ideal. Here is the course of action that we'll propose to take:
Within the next 2 weeks, we anticipate that we'll likely to have GIAB labels for X and Y. The first course of action that we'll take is to incorporate those and attempt to train a new model with this. Based on whether this looks promising, we may ask if you are interested to test a Docker image with this model and provide feedback on it.
I can't guarantee that this will work, but I think it has reasonable odds. If it does not, we do have other ideas for X and Y calling, but they would take a bit more time.
I will plan to reach back out in 2 weeks with an update about the labels and a refined estimate for when we might have the model.
Thank you, Andrew
Yes this sounds like a great plan. Happy to try out your docker beta version too see if it fixes this hemizygous issue.
Have a good weekend, Phil
Hi @Phillip-a-richmond
Just as an update on this issue, we have the truth sets and have trained some experimental models. We're still in the process of refining these models, as well as how to use the T2T truth sets. So we do have progress, but it's still going to take time before we're ready with something for you to look at.
Thanks @AndrewCarroll! Happy to test beta version when you produce them.
Hi @Phillip-a-richmond , I want to give an update on this issue:
As part of working on v1.4.0, we did some experiments on this, which is still work in progress.
If you can reach out to me and @AndrewCarroll (you can email me at pichuan@google.com), we can follow up on an experimental model for you to test, if you're still interested.
Hi @Phillip-a-richmond and everyone else who might be following this thread, One update:
one post-processing trick that you can do now is: take the probability distribution, and ignore the 0/1 one, and just renormalize the other two. From there, you can decide whether it should be a 0/0 or 1/1. (This idea came from our collaborator @doron-st at Ultima Genomics. They actually verified this on a dataset and showed good precision. They did this on DeepVariant, not DeepTrio. But I think the same trick can be applied)
Our team is currently considering building this into an option in postprocess_variants
, so that we can do this re-normalization ourselves.
(Another more principled approach will be to build this knowledge into the modeling part. We're also considering that, but that will be an even longer-term solution.)
Thanks @Phillip-a-richmond again for reporting this. We have filed an internal issue to track this work. I'll close this bug for now.
Follow up from my previous comment, @doron-st has shared his script here: https://github.com/Ultimagen/VariantCalling/blob/c2634d8c08db4473e61b786eed06afc2bb8fccd1/ugvc/pipelines/convert_haploid_regions.py (Thanks Doron!)
Any updates on this? The link for the script is broken now.
@nurmians Still looking through the code - and looked at the release notes - though still reasoning through it in case it might have been implemented via some alternate logic. In any case, here's a link to the script for converting genotypes of regions to haploid calls:
https://github.com/Ultimagen/VariantCalling/blob/master/ugvc/pipelines/convert_haploid_regions.py
Hi @nurmians , The link the @pgrosu link to seems like the right one from Ultima!
Our team is working on incorporating this as part of postprocess_variants. The official version will be out in the next release (before end of 2023). You can get a preview of these 2 flags that we'll be adding: https://github.com/google/deepvariant/blob/dev/deepvariant/postprocess_variants.py#L170-L188
(Note that we don't officially support the code in "dev" branch. If you want an official documentation and usage of this, please wait for the release later. But please feel free to take a look at the code if you like)
Hi Pi-Chuan,
I was thinking about the nice upcoming changes a bit. It will probably be easier to consolidate the two proposed flags into only one parameter such as --non_recombinant_regions
(or --hemizygous_regions
), since all the other regions go through recombination. The PAR regions is already the default behavior of DeepVariant as it assumes diploid recombinant behavior, so you don't need the --par_regions_bed
flag.
~p
Hi @pgrosu
Thank you for the suggestion. The exact way to manifest the flags came up in internal discussion. I actually proposed basically the approach you detail here (I wanted to call the parameter --haploid_regions).
The counter-argument was that people who have BED files for what they want to call more often have these files already in the format of the chromosome and a separate file of the PAR regions. So it is possible to generate the --haploid_regions or --hemizygous_regions using a bedtools subtract command, but that does require one other step and then users who wanted to document or publish what they have done have a non-standard file.
I was persuaded to drop my suggestion and endorse the solution you currently see in L170-L188 by this argument. However, I also did not see a large relative advantage with either approach from a user perspective, so I did not advocate particularly forcefully. I do like that the current arrangement will be able to specify some pretty standard community files.
If you think the difference for a user will be more strongly in preference of the single flag, we'd be open to the discussion, but hopefully the rationale for the current flag choice makes a bit more sense.
Hi Andrew,
I think we are in perfect agreement. Supporting what the community wants is my preferred approach as well.
I was only suggesting it after having performed a reduction upon the functional set of DeepVariant's design.
Just in case the community requires more variety of labeled inputs, having a parameter processor might help in mapping between DeepVariant's core operation and different parameter sets. This would create flexibility for both the goals of the design team and community.
For example, the design team might want flexibility in playing around with different transformations of read signals and parameter search spaces - which might change over time or with different data sets - without worrying that the parameter list might require updating.
Let's take the following simplified model optimization:
The team would let this model optimize itself to determine the parameter set under which it performs ideally. Then metadata about the training would be saved along with the optimized model. Then users can query the model's metadata via some program, which would provide information about the training data and parameters limits for different model types. Such metadata would be automatically populated through the model training step (where I'm including the validation and tuning steps). Based on that, a default set of parameters can be auto-generated for general users, and advanced users can then map their custom parameters however they want. It basically makes the training turn-key and usage automatic. Such training could even be remotely initiated through a browser, and different users can make some of their models available to others if they want, thus helping out the community and growing it at the same time.
This would also make adapting new models much faster, including where a large set of specialized ensemble models might be required to check against.
Best regards, Paul
Hello,
Noticed this issue with your tool DeepTrio regarding the representation of hemizygous variants in the non-pseudoautosomal (PAR) X-chromosome. This may be fixed now in 1.3? If so ignore this, but if not this is what I noticed.
Note, this is a simulated pathogenic variant from bamsurgeon, but the VCF representation is the focus of this problem.
Let's start with an IGV snapshot of the variant:
Now, I'll go into the representation from the DeepVariant --> GVCF --> GLnexus pipeline:
DeepVariant Pipeline:
Variant:
VCF header:
Variant line
DeepTrio
Now, with the DeepTrio -> GVCF -> GLNexus pipeline: Pipeline
Variant
Header
Variant line
Here the male proband, fully hemizygous for the variant, is represented as 0/1: 0/1:18:0,18:4:33,0,1:..
Why it matters
Now, this is a BIG problem, because downstream tools like Exomiser will treat the output here for variant prioritization in rare disease cases. So take a look at this.
Exomiser HTML for DeepVariant, where variant is represented as 1/1. This gene is the top ranked hit for this simulated rare disease case. Good:
Exomiser HTML output for DeepTrio, where variant is represented as 0/1. Notice that now the gene is not even in the top 10 candidates, as Exomiser has interpreted this variant incorrectly (likely assuming it was a PAR region on X). The variant score dropped to 0.00 because the assumption on the GT string was violated:
I can send the HTML outputs, VCFs, or even CRAMs if you are interested by email or other file transfer.
Anyways, going to try with 1.3 while I wait for an answer here.
Cheers, Phil