Closed SHuang-Broad closed 3 months ago
Hi @SHuang-Broad
This is a good question and the best answer I can give to it is both complicated and not conclusive.
TL;DR - for long reads, retraining will (probably) not change things much, but there might be other future opportunities to use T2T truth in training strategies regardless of the reference used.
Generally, I think DeepVariant will give good results on T2T without retraining specifically for it, and given the better completeness of the T2T reference, probably just using this will give generally better results. In the past, we have trained with both GRCh37 and GRCh38, and we don't see the model behaving very differently with either reference.
It's possible that re-training with the T2T could lead to marginally better accuracy in some areas, especially in segmental duplications, which are better resolved in T2T. At present, GRCh38 will have some segmental duplications that are collapsed, or where a copy is missing. DeepVariant seems to have learned some of the patterns for this, and is can sometimes reject variants in regions that look like segdups. This feature is not necessarily bad, but both using and training with the T2T reference might help it adjust its priors for how likely this is in the T2T reference (we had a poster discussing this phenomenon and how DeepVariant reacts to it).
However, I think the effects of retraining here will be fairly minimal and restricted to either segmental duplication regions or structural variants. In addition, it's unclear to me whether this would occur only for short-reads. Right now, the quality of the truth sets is limited by long-read mappability to the reference. With good coverage of HiFi reads, we expect SNP F1 of more than 0.999. It seems likely that the model already knows enough to accurately call variants if the reference can be resolved, and though T2T may help the mapping resolve the remainder, it's unclear whether there is more to learn in further training.
This highlights one key point where T2T may help with training - that the current training is limited by the truth set. Training with v4.2.1 truth sets is still constrained by the confident regions of the genome. If we can get fully complete, 100% accurate truth sets covering the genome, this will provide more training examples of difficult regions in the training process, and I think this could further improve a model (whether it's on the T2T reference or on GRCh38). I think there will be an opportunity for this as complete T2T assemblies become available for more samples.
Finally, from a practical perspective, the current v4.2.1 truth sets are relative to GRCh38, so in order to train we'd need to first be able to generate truth variants and confident regions for some sample on T2T. That's certainly doable, but it is tricky to do correctly.
Hopefully this has answered more questions than it has opened. If this is an area you have ideas about or are interested in collaborating on, we'd certainly be happy to explore those together.
Thank you, Andrew
Hi @SHuang-Broad , I'll close this issue now. Feel free to reopen or ask another question. Or directly reach out via email if you want to have more discussions.
I'm working on a project related to T2T variant calling. @AndrewCarroll, has your group examined this question in more detail? I am currently documenting the impact of T2T vs GRCh38 alignment on variant calling. I can tell you already that it has a large effect on alignment quality.
Documenting the impact is one thing, but retraining DeepVariant is another. I could do it, but I don't look forward to it.
If interested, I can provide my cram files. I have your published HG002-HG007 illumina reads at 20X and 30X depth aligned to:
GRCh38 (w/ BWA-mem)
T2Tv2.0 (w/ BWA-mem
GRCh38 (aligned to HPRCv1.1 w/ vg giraffe, surjected to GRCh38)
T2T (aligned to HPRCv1.1 w/ vg giraffe, surjected to T2T)
GRCh38 (aligned to personalized graph* created from HPRCv1.1, surjected to GRCh38)
T2T (aligned to personalized graph* created from HPRCv1.1, surjected to T2T)
per the protocol outlined in this paper: https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2
@JosephLalli ,
The current schema of DeepVariant training depends on having GIAB calls that we use as truth available against the reference. The GIAB truth set against T2T is still not available and released so currently we are not using T2T to train our models. Lifting the calls over to the T2T reference would not add too much value as it simply doesn't extend the truth set rather transfers it from one reference to the other.
We are connected with the GIAB and T2T team. Once the resources are available, we will add those to our training scheme. Let us know if you have any further questions.
Thanks! I don't know if I agree that a lifted over variant set would not provide value - after all, even if the variants themselves are the same, we'd expect there to be fewer pileup regions due to differences in how off-target reads align. Even if you're right, there's only one way to know for sure - try it out and see!
That being said, your reasoning makes sense. I'll keep an eye out for any updates.
PS - I wonder if someone has begun working with the HG002 Q100 assembly? That can be aligned to GRCh38 or T2T reference and, in theory, used as a ground truth variant set for HG002 reads. (Although, I believe that Zook is still evaluating the viability of that option.)
Hi @JosephLalli
It's a good question. Overall, I agree with what Kishar said about liftover. With historical truth sets (e.g. v3.3.2 which had GRCh38 variants lifted over from GRCh37), which observed artifacts from the liftover process. One factor to keep in mind is that the truth sets have such high label quality that even a few errors makes a big difference.
We've been talking with Justin Zook about the T2T Q100 assembly. My expectation is that this will represent the highest quality mechanism to get labels on the T2T assembly. My understanding is that a GRCh38 investigation of this assembly will come first.
So we haven't yet worked on it, and I think it won't be very imminent, but I believe it is something that we will eventually investigate as the resources become more available for it.
What timeframe do you think is required for your purposes?
Thank you, Andrew
Hi @JosephLalli
It's a good question. Overall, I agree with what Kishar said about liftover. With historical truth sets (e.g. v3.3.2 which had GRCh38 variants lifted over from GRCh37), which observed artifacts from the liftover process. One factor to keep in mind is that the truth sets have such high label quality that even a few errors makes a big difference.
We've been talking with Justin Zook about the T2T Q100 assembly. My expectation is that this will represent the highest quality mechanism to get labels on the T2T assembly. My understanding is that a GRCh38 investigation of this assembly will come first.
So we haven't yet worked on it, and I think it won't be very imminent, but I believe it is something that we will eventually investigate as the resources become more available for it.
What timeframe do you think is required for your purposes?
Thank you,
Andrew
I agree re: lifting over. I think waiting for HG002-Q100 makes sense.
As far as timeframe goes, that's really up to Justin Zook. I hope to submit for publication by the end of the year, but we'll see...
Thanks for reaching out, Joe
Describe the issue:
I apologize as this is a question rather than a problem. So this ticket isn't using any predefined template.
Here's my question: given the newly released human T2T reference (v2.0), should DV be re-trained against that reference? I must admit I don't understand DV deep enough to ponder with what the potential benefits would be, so am curious about your thoughts.
Thanks!
Steve
p.s. the data modes most relevant for us are CCS, ONT.