ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
37 stars 12 forks source link

Skipped variants in VCF file #74

Closed mattbird567 closed 1 year ago

mattbird567 commented 1 year ago

I am trying to introduce variance via a VCF file but when i try and do so NEAT will skip a large proportion of these variants and i'm not quite sure why (although i think it might be syntax that NEAT isn't used to). I have attached the VCF in question to this post to help but in principle i assume that NEAT is skipping all those variants that include any GT field that isn't 1/1. This may not be the problem but just though i would ask incase its being filtered out for another reason.

`Warning: Found variants without a GT field, assuming heterozygous.. found 3614 valid variants in input vcf. 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax) 0 variants skipped due to multiple variants found per position

reading AL123456.3 Mycobacterium tuberculosis H37Rv complete genome.. 3.196 (sec)

sampling reads..`

N0004.txt

joshfactorial commented 1 year ago

I can investigate further if you want, but I'm pretty sure at a glance that NEAT would just skip any variants where there were multiple options in the ALT column. I'm less sure why it didn't just label them as "variants skipped due to mulitple variants found per position" I think that NEAT does not recognize comma separated ALTs as multiple variants but is instead only looking for chr/pos/ref duplicate lines.

joshfactorial commented 1 year ago

That may be an item for the backlog, to at the very least correctly catalog what it's skipping.

joshfactorial commented 1 year ago

Okay, I've looked a little more at the code and a couple of things are confusing me about the output you're seeing so I want to dive it a bit more. First, I'd need to know which version of NEAT were you are using. The GT comment doesn't jive and I can't see a solid logical reason in the code why those should have been skipped (it has code to handle multiple alt alleles). I'll need to look at it closer.

mattbird567 commented 1 year ago

Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT.

I am using NEAT version v3.3 as well.

joshfactorial commented 1 year ago

Is the reference of pretty reasonable size that you could email it to me?

From: Matthew Bird @.> Sent: Thursday, June 8, 2023 3:00 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74)

Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT.

I am using NEAT version v3.3 as well.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/ncsa/NEAT/issues/74*issuecomment-1583260880__;Iw!!DZ3fjg!_SchjIxJ9eXmHYgTU_C-XdU5kiwW-B50_OJTZaVZqmRMsKllEeKlr7a2nA1Va3n_33PXBKPQSwnrHlAvfkmNh_G4YAvcGQ$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AGMI727ZI57YBWBZZAAAPNTXKIVKXANCNFSM6AAAAAAY67DCD4__;!!DZ3fjg!_SchjIxJ9eXmHYgTU_C-XdU5kiwW-B50_OJTZaVZqmRMsKllEeKlr7a2nA1Va3n_33PXBKPQSwnrHlAvfkmNh_HlTCDk-g$. You are receiving this because you commented.Message ID: @.***>

mattbird567 commented 1 year ago
I have zipped and attached the ref here. Sent from Mail for Windows From: Joshua AllenSent: 08 June 2023 21:06To: ncsa/NEATCc: Matthew Bird; AuthorSubject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Is the reference of pretty reasonable size that you could email it to me? From: Matthew Bird ***@***.***> Sent: Thursday, June 8, 2023 3:00 PM To: ncsa/NEAT ***@***.***> Cc: Allen, Josh ***@***.***>; Comment ***@***.***> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT. I am using NEAT version v3.3 as well. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***> —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***> 
joshfactorial commented 1 year ago

That seems to have failed. Try sending it directly to me: @.***

From: Matthew Bird @.> Sent: Thursday, June 8, 2023 3:15 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74)

I have zipped and attached the ref here. Sent from Mail for Windows From: Joshua AllenSent: 08 June 2023 21:06To: ncsa/NEATCc: Matthew Bird; AuthorSubject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Is the reference of pretty reasonable size that you could email it to me? From: Matthew Bird ***@***.***> Sent: Thursday, June 8, 2023 3:00 PM To: ncsa/NEAT ***@***.***> Cc: Allen, Josh ***@***.***>; Comment ***@***.***> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT. I am using NEAT version v3.3 as well. — Reply to this email directly, view it on GitHub —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>
joshfactorial commented 1 year ago

It stripped my email, but it's on my profile.

joshfactorial commented 1 year ago

I added some subcategories to try to nail down what was getting filtered.

* 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax)
    * 3629 variants filtered out by FILTER column
    * 0 variants lacked a genotype field
    * 368 variants had no associated allele
    * 0 variants had an invalid position

It looks to me like most of them because the FILTER column was anything other than PASS. A few were because there was no associated allele in the genotype. So like if the genotype was 0/0, or if there were 2 alt alleles and the genotype was 1/1, the second alt allele gets filtered out. There's an additional thing that can happen in rare cases, where when NEAT is writing the reads out it will sometimes skip a variant if it is too close to another.

joshfactorial commented 1 year ago

Note that the first one is one of several checks, it also looks to see if the REF == ALT and if there is missing data in a line. The filter is the most likely, but we can get finer resolution if that doesn't solve the problem.

joshfactorial commented 1 year ago

Just for giggles I did one further breakdown:

* 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax)
    * 0 variants had malformed vcf lines
    * 0 variants had a missing allele or ref == alt
    * 3629 variants filtered by FILTER column
    * 0 variants lacked a genotype field
    * 368 variants had no associated allele
    * 0 variants had an invalid position
 * 0 variants skipped due to multiple variants found per position
mattbird567 commented 1 year ago

Fab! Thanks for taking a look at this, i ended up writing a script that would convert the VCF to a format that would work with NEAT v3.3 but will look at the changes you've made.