Suggestions for SV control database

pdiakumis commented 6 years ago

Hi Ryan,

We’ve been thinking about using STIX to generate a control database for querying somatic SV calls output by Manta (our preferred SV caller). Basic idea is to use ~100 germline BAM samples, put them through Excord, Giggle the output BEDPEs, create STIX db, then do a STIX query for the SVs of interest. This way we could see how many germline hits our SV query would get.

However we’re wondering what sort of impact the Excord split/discordant read extraction algorithm would have with the Manta calls i.e. Manta might not use the same definition of a split/discordant read as Excord to infer a SV, so a Manta germline SV region might not make it into the final STIX db.

As an alternative, we thought of extracting split/discordant reads around the Manta germline SVs, creating the required BEDPE (in some way) and then going through the same steps as above.

Open to any suggestions. Also paging @ohofmann.

ryanlayer commented 6 years ago

On Mon, Jan 8, 2018 at 10:46 PM, Peter Diakumis notifications@github.com wrote:

Hi Ryan,

We’ve been thinking about using STIX to generate a control database for querying somatic SV calls output by Manta (our preferred SV caller). Basic idea is to use ~100 germline BAM samples, put them through Excord, Giggle the output BEDPEs, create STIX db, then do a STIX query for the SVs of interest. This way we could see how many germline hits our SV query would get.

Sound good. This is the intended use of STIX.

However we’re wondering what sort of impact the Excord split/discordant read extraction algorithm would have with the Manta calls i.e. Manta might not use the same definition of a split/discordant read as Excord to infer a SV, so a Manta germline SV region might not make it into the final STIX db.

There are command line parameters for both excord and STIX that you can adjust, but these are inteded to reflect the properties of the library and not the opinions of a caller. Maybe you can hack in your changes with these options.

I will say that that Manta and all other pair-end based callers have the same basic opinion of what constitutes pair-end evidence, and STIX follows these same priciples. There are some small difference when it comes to split-read evidnce, so STIX is concervative here.

My suggesttion would be to try STIX as is.

As an alternative, we thought of extracting split/discordant reads around the Manta germline SVs, creating the required BEDPE (in some way) and then going through the same steps as above.

This would work as long as the file formats are the same.

Open to any suggestions. Also paging @ohofmann https://github.com/ohofmann.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ryanlayer/stix/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlDUUmOXGbPs56PuCogtqaIZU1Aee5dks5tIv0vgaJpZM4RXV9- .

-- Ryan Layer

ohofmann commented 6 years ago

Thanks Ryan. I was concerned that different definitions of split/discordant reads between Excord and the various bcbio SV callers (Manta / Lumpy / Gridss and Co) might result in false positive calls slipping through. We can test this when we load the germline samples and revisit.

Also paging @eldrid01 and @chapmanb to this thread. Matt, only took us two years to re-visit the 'panel of normals for SV' discussion. Ryan's work seemed like a perfect fit.

ryanlayer commented 6 years ago

On Jan 9, 2018, at 3:02 PM, Oliver Hofmann notifications@github.com wrote:

Thanks Ryan. I was concerned that different definitions of split/discordant reads between Excord and the various bcbio SV callers (Manta / Lumpy / Gridss and Co) might result in false positive calls slipping through. We can test this when we load the germline samples and revisit.

If we are missing some signal in the bams (besides read depth) please let us know. We may be able to modify excord to extract it. Also paging @eldrid001 and @chapmanb to this thread. Matt, only took us two years to re-visit the 'panel of normals for SV' discussion. Ryan's work seemed like a perfect fit.

Panel of normal (or not normal) is exactly what STIX was built to support.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

chapmanb commented 6 years ago

Thanks for this great discussion, it would be great to have more formal sets of panels of normals both for SV noise subtraction as well as feeding into callers for background in tumor only or single germline cases. One thing I've wanted to have is a default panel you can use for both WGS and exome/capture when you don't have an existing background set. I realize this would be imperfect, especially for capture where you'd ideally have the exact same prep/tech methods, but having this in place as a default could at least help the many cases we support where users don't have a big corpus of existing data and want to improve over a flat background/no filtering. I'm not sure what the best resources to get started for this would be and how many samples we could try to put into such a background set, but wanted to propose this as a way to build a community dataset on top of the great functionality in STIX/GIGGLE.

ohofmann commented 6 years ago

Could use the 1000G SV database for that; I am not sure we'd be able to disseminate that though - the summary information might be sufficiently removed from the genome sequence to allow for that. I don't think you'd have to subset the database itself as you'd only use it as a filter, and additional calls present that are not detectable in the exome/capture method wouldn't matter.

eldrid01 commented 6 years ago

Hi Oliver,

I've not tried to compile a set of blacklisted SVs from the 1000 Genomes variants but this would definitely be worth trying. The only downside would be if the 1000 Genomes project did a really good job of cleaning up SV calls that are basically common false positives due to alignment and assembly issues. Those are exactly the SVs we'd want to filter.

Not sure about dissemination, had a quick look on their site and didn't come across anything that explicitly prevents this and what we'd end up with would of course be derived from many individuals rather than calls linked specifically to any particular individual. Would need looking into.

Matt

On 10 January 2018 at 02:17, Oliver Hofmann notifications@github.com wrote:

Could use the 1000G SV database for that; I am not sure we'd be able to disseminate that though - the summary information might be sufficiently removed from the genome sequence to allow for that. I don't think you'd have to subset the database itself as you'd only use it as a filter, and additional calls present that are not detectable in the exome/capture method wouldn't matter.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ryanlayer/stix/issues/2#issuecomment-356479174, or mute the thread https://github.com/notifications/unsubscribe-auth/AFTTHUOhMayjiDhR9idJtdVAf24Mxld0ks5tJB3UgaJpZM4RXV9- .

ryanlayer / stix

Suggestions for SV control database #2