google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.16k stars 713 forks source link

CLR PacBio data #347

Closed jdmontenegro closed 3 years ago

jdmontenegro commented 3 years ago

Hi team,

I am very interested in using DeepVariant on my dataset to identify novel SNV and Indels from a diploid plant. However, I will probably need the ability to identify non-reference heterozygous SNVs (1/2 variants). is DeepVariant able to discover these variants, or does it need one of the variants to be the same as the ref? Second, I read in the main page that you support Illumina, HiFi PacBio and ONT, but did not find any information on CLR PacBio, do you not support this kind of reads anymore?

Kind regards,

Juan D. Montenegro

AndrewCarroll commented 3 years ago

Hi @jdmontenegro

For the question about multi-allelic heterozygous calls - yes, DeepVariant is able to all 1/2 events, and will represent these in one line as a GT 1/2 call in the VCF.

For CLR calling in DeepVariant. It is theoretically possible for us to make a model for DeepVariant that can call CLR data. However, this requires us to write a special candidate generation logic to deal with the higher error rate. Based on what we perceive for the direction of future use in the genomics community, we think that data generated will be increasingly HiFi, so we have not been able to highly prioritize CLR models. Feedback from users like yourself will be useful to us in evaluating if that prioritization makes sense. For now, I can't commit to a timeframe under which we would support a PacBio CLR model.

jdmontenegro commented 3 years ago

Dear Andrew,

Thank you for your quick reply. I agree with you that most sequencing and resequencing projects will move towards HiFi reads rather than CLR reads. However, there is a lot of CLR sequencing data that has been generated in the past couple of years and continues to be produced currently and could still be useful for groups without the means to resequence using the novel HiFi reads. So, I definitely see a niche in a large part of the bioinformatics community that do a lot of data reusing (nowadays data parasites). So, if there is anything we can do to help you n development, please feel free to let me know how we can collaborate.

Kind regards,

Juan D. Montenegro

El mar., 15 sept. 2020 a las 18:37, Andrew Carroll (< notifications@github.com>) escribió:

Hi @jdmontenegro https://github.com/jdmontenegro

For the question about multi-allelic heterozygous calls - yes, DeepVariant is able to all 1/2 events, and will represent these in one line as a GT 1/2 call in the VCF.

For CLR calling in DeepVariant. It is theoretically possible for us to make a model for DeepVariant that can call CLR data. However, this requires us to write a special candidate generation logic to deal with the higher error rate. Based on what we perceive for the direction of future use in the genomics community, we think that data generated will be increasingly HiFi, so we have not been able to highly prioritize CLR models. Feedback from users like yourself will be useful to us in evaluating if that prioritization makes sense. For now, I can't commit to a timeframe under which we would support a PacBio CLR model.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/deepvariant/issues/347#issuecomment-693053180, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHSLOV5RPVLTVGDW2A44X3SF73E7ANCNFSM4RNQJZYQ .

AndrewCarroll commented 3 years ago

Hi @jdmontenegro

I am going to close this issue for now. I will make a note to send you a message if/when we can revisit the CLR model. Thank you for your perspective on what data the community has and what will be valuable to them.

starskyzheng commented 2 months ago

Hi, @AndrewCarroll Any new about CLR PacBio reads? Recently we meet a same situation to process published data which contains many CLR PacBio reads. Thanks a low.