ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Insertion of CNVs #62

Closed lukatop closed 1 month ago

lukatop commented 1 year ago

Hi, thanks for the great tool!

I saw the issue here about inserting CNVs which should be provided in a VCF and given with: -v option. I tried doing that, but I can see that these variants will be skipped as they don't contain ACTG in ALT field.

reading input VCF...

found 247 valid variants in input vcf.
 * 0 variants skipped: (qual filtered / ref genotypes / invalid syntax)
 * 1 variants skipped due to multiple variants found per position
--------------------------------
reading EB0001...
20.439 (sec)
found 219 valid variants for EB0001 in input VCF...
7 variants skipped...
 - [0] ref allele does not match reference
 - [0] attempting to insert into N-region
 - [7] alt allele contains non-ACGT characters

I tried two different lines, with format of CNV:

EB0001 1087750 . C < CNV > . PASS SVTYPE=CNV;END=1090600;SVLEN=2850;Ploidy=1;DP=117 GT:DP:CN:AF 1:83:2:1 EB0001 1087750 . C CNV . PASS SVTYPE=CNV;END=1090600;SVLEN=2850;Ploidy=1;DP=117 GT:DP:CN:AF 1:83:2:1

In both cases it complains about ALT non-ACTG alleles. Can you provide an example of the CNV line in the VCF format you got it working?

joshfactorial commented 1 year ago

I think the only way to do this at this point would be to manually write out the CNV, like

EB0001 1087750 . C CCCCCCCCCCCCCCCCCCCCCCC . PASS ... etc

Is what you have the standard way of notating CNVs? If so, I can see if we can incorporate that into a future update.

lukatop commented 1 year ago

Hi @joshfactorial for the fast response. I don't think this is a standard way, since there are a lot of ways to represent CNVs, I tried this one as it was the easier for me to test. GATK has a little bit different representation: image

Also, some tools like CNVkit or ControlFreec are outputting TSV files. Anyway you add it will be fine. Current implementation could be troublesome for larger CNVs: 2k bp or more. Best, Luka

joshfactorial commented 1 year ago

Sounds good. We’ll look into this more. We might be able to implement as part of NEAT 4.1, but probably not the 4.0 release.

-Josh

From: lukatop @.> Sent: Friday, September 2, 2022 8:56 AM To: ncsa/NEAT @.> Cc: Allen, Joshua @.>; Mention @.> Subject: Re: [ncsa/NEAT] Insertion of CNVs (Issue #62)

Hi @joshfactorialhttps://urldefense.com/v3/__https:/github.com/joshfactorial__;!!DZ3fjg!584Jex3bwsvnD3LzxyZroEYNmdX8GGF6BPCxS5I6asqiZKPnY6xMplqCou4kCRRIcVJ9v6jtrOEQD18MrKyE00UtAKOYpw$ for the fast response. I don't think this is a standard way, since there are a lot of ways to represent CNVs, I tried this one as it was the easier for me to test. GATK has a little bit different representation: [Image removed by sender. image]https://urldefense.com/v3/__https:/user-images.githubusercontent.com/106684627/188163187-7eb2333a-51e0-4087-854f-e1b29592fe28.png__;!!DZ3fjg!584Jex3bwsvnD3LzxyZroEYNmdX8GGF6BPCxS5I6asqiZKPnY6xMplqCou4kCRRIcVJ9v6jtrOEQD18MrKyE00X1ko3HQA$

Also, some tools like CNVkit or ControlFreec are outputting TSV files. Anyway you add it will be fine. Current implementation could be troublesome for larger CNVs: 2k bp or more. Best, Luka

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/ncsa/NEAT/issues/62*issuecomment-1235536285__;Iw!!DZ3fjg!584Jex3bwsvnD3LzxyZroEYNmdX8GGF6BPCxS5I6asqiZKPnY6xMplqCou4kCRRIcVJ9v6jtrOEQD18MrKyE00XnXOtfzQ$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AGMI723NGBNYYISDFOMMICDV4IBOPANCNFSM6AAAAAAQBK2JAE__;!!DZ3fjg!584Jex3bwsvnD3LzxyZroEYNmdX8GGF6BPCxS5I6asqiZKPnY6xMplqCou4kCRRIcVJ9v6jtrOEQD18MrKyE00X8BB7I8Q$. You are receiving this because you were mentioned.Message ID: @.***>

joshfactorial commented 1 year ago

We're investigating reading in CNVs from an input VCF for NEAT version 4.0

joshfactorial commented 1 month ago

Added to the backlog