How do you see the future of CNN outside of human genomics?

ghost commented 3 months ago

Hello,

some of you might remember me. I know Deepvariant works well in human and in some species like rice, if I recall well. In short, all species with (very) low heterozygousity. I wonder if you see a use for Deepvariant in other species, like, there are marine species that are so ancient, diverse, widespread, you can have 5% heterozygosity, in shorts, SNPs everywhere. In such cases, Deepvariant has a tendency to "ditch" apparently at random (Sample1 Chrom3:20456 called, Sample2 same position not called, despite obvious evidence from mapping and support from long reads). Probably because it didn't learn what to do with so many SNPs. You know the issue because of your mosquito blog spot. And I have seen other issues (including mine) talking about that.

The issue is to have a gold standard like in human, or trio data like in the mosquito, you need specific conditions, it seems difficult to imagine this could be doable with, let's say, a deep sea coral (just random example, I don't actually know what's their genome like).

Could a synthetic dataset help here? What if we feed Deepvariant a genome we made up based on what we can observed visually? I am aware if we make an error it will learn errors, but I wanted your opinion, because the lack of high quality reference dataset for many species, seems to be a serious limitation for this kind of program.

Thanks a lot. Since it's not the first time I bring this out, I understand if you would simply close this.

Have a good week everyone.

AndrewCarroll commented 3 months ago

Hi @Axze-rgb

It's a reasonable question. We're in the process of training some non-human models using trio data and hopefully this experiment will both be positive and result in release-sable models. It's going to still take some time, but it remains an important area and one we think about.

ghost commented 3 months ago

Thanks for that answer, I was also curious if you thought synthethic data could help (as in, let's simulate a whole genome from fragments we manually analysed in a specific species with high snp counts). I have no intent to do such thing, I am just curious to have your opinion on that kind of synthetic data for training. Sorry, it's really pure curiosity and not really a DeepVariant issue but I thought it could be interesting to know. Thanks a lot!

EDIT: there is still quite a lot of skepticism around those technologies, at least where I am and it's a bit tiring to always answer the question "yes but are you sure your AI can count?", this is the origin of my question.

EDIT 2: and yes I always have to re-specify it's not MY AI

AndrewCarroll commented 2 months ago

Hi @Axze-rgb

To date, I've been hesitant to include simulated data as I'm very confident that we don't know all of the diverse and complex ways that errors happen to be able to model them. To a limited extent it might be possible to supplement training, but I'm fairly pessimistic on the approach.

For the skepticism component, one of the things we try to work on is investigating explainability - cases where the ML model has learned something that might not have been immediately obvious to a human. Hopefully we'll have some things to share there soon.

ghost commented 2 months ago

Sorry to answer that late, I am extremely happy to read this and about the complexity of the problem. How many times have I heard "just lowered the gap penalty" or "just increase the score for a mismatch". While well-meaning, those didn't help me. I have arrived to a point when I am even doubting the validity of the whole field outside of human. Let's make an MA of Daphnia. Then let's ask another team to repeat exactly the experiment. Are you confident you will get the same kind of value? I realise I am absolutely not sure... Sorry I digress. But I am happy because you are a caller developer, which I am not, and you write exactly all the issues I have with the field. I think the complexity of variant calling is vastly underestimated. And many forget that it's not because a program keeps giving you the same answer, that it's correct. Anyway on my side I also have a totally different approach in the pipeline and I would be very excited to compare with yours. Cheers

google / deepvariant

How do you see the future of CNN outside of human genomics? #872