google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 713 forks source link

How do you see the future of CNN outside of human genomics? #872

Open Axze-rgb opened 2 weeks ago

Axze-rgb commented 2 weeks ago

Hello,

some of you might remember me. I know Deepvariant works well in human and in some species like rice, if I recall well. In short, all species with (very) low heterozygousity. I wonder if you see a use for Deepvariant in other species, like, there are marine species that are so ancient, diverse, widespread, you can have 5% heterozygosity, in shorts, SNPs everywhere. In such cases, Deepvariant has a tendency to "ditch" apparently at random (Sample1 Chrom3:20456 called, Sample2 same position not called, despite obvious evidence from mapping and support from long reads). Probably because it didn't learn what to do with so many SNPs. You know the issue because of your mosquito blog spot. And I have seen other issues (including mine) talking about that.

The issue is to have a gold standard like in human, or trio data like in the mosquito, you need specific conditions, it seems difficult to imagine this could be doable with, let's say, a deep sea coral (just random example, I don't actually know what's their genome like).

Could a synthetic dataset help here? What if we feed Deepvariant a genome we made up based on what we can observed visually? I am aware if we make an error it will learn errors, but I wanted your opinion, because the lack of high quality reference dataset for many species, seems to be a serious limitation for this kind of program.

Thanks a lot. Since it's not the first time I bring this out, I understand if you would simply close this.

Have a good week everyone.

AndrewCarroll commented 2 weeks ago

Hi @Axze-rgb

It's a reasonable question. We're in the process of training some non-human models using trio data and hopefully this experiment will both be positive and result in release-sable models. It's going to still take some time, but it remains an important area and one we think about.

Axze-rgb commented 2 weeks ago

Thanks for that answer, I was also curious if you thought synthethic data could help (as in, let's simulate a whole genome from fragments we manually analysed in a specific species with high snp counts). I have no intent to do such thing, I am just curious to have your opinion on that kind of synthetic data for training. Sorry, it's really pure curiosity and not really a DeepVariant issue but I thought it could be interesting to know. Thanks a lot!

EDIT: there is still quite a lot of skepticism around those technologies, at least where I am and it's a bit tiring to always answer the question "yes but are you sure your AI can count?", this is the origin of my question.

EDIT 2: and yes I always have to re-specify it's not MY AI

AndrewCarroll commented 1 week ago

Hi @Axze-rgb

To date, I've been hesitant to include simulated data as I'm very confident that we don't know all of the diverse and complex ways that errors happen to be able to model them. To a limited extent it might be possible to supplement training, but I'm fairly pessimistic on the approach.

For the skepticism component, one of the things we try to work on is investigating explainability - cases where the ML model has learned something that might not have been immediately obvious to a human. Hopefully we'll have some things to share there soon.