google / deepsomatic

DeepSomatic is an analysis pipeline that uses a deep neural network to call somatic variants from tumor-normal sequencing data.
BSD 3-Clause "New" or "Revised" License
96 stars 12 forks source link

Depth of coverage for WGS #6

Closed gevro closed 8 months ago

gevro commented 8 months ago

Hi, Does the performance depend on the depth of coverage of WGS data? Is the model trained on a specific depth of coverage and does deviating from that in test samples affect/bias performance?

pichuan commented 8 months ago

Hi @gevro , you're right - just like germline variant calling, the performance will certainly depend on the coverage of the data.

Similar to DeepVariant, when making training examples, in make_examples_somatic, we can set different downsample_fraction and create more examples with different coverage of the data to make things more robust.

This is our first release, so it's possible that our model has room for improvement for robustness. If you do notice something that didn't work as well, please let us know.

I'll close this issue for now. Feel free to follow up if you have some observations to share or things to add.

gevro commented 8 months ago

Thanks. What is the maximum depth you trained on?

pichuan commented 8 months ago

If you're asking about Illumina WGS data, our training data comes from https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/data/WGS/

You can read more about the data here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8532138/

AndrewCarroll commented 8 months ago

Hi @gevro

The model is trained on samples that have a range of coverages. The maximum pileup height for the tumor is 100. Average coverage above 100 probably won't additionally help.

We don't have a huge amount of titration data for high coverages yet. However, based on prior experience, it might be a bit better to downsample tumor coverage to a max of ~90x coverage. However, I suspect this might only make a minor difference to accuracy. It will, however, make running the model quite a bit faster.

For the normal, I would downsample to at most ~50x. This is mostly just for speed. I think the model is going to be less sensitive to coverage of the normal.

gevro commented 8 months ago

Thank you!