Closed mattbird567 closed 1 year ago
I can investigate further if you want, but I'm pretty sure at a glance that NEAT would just skip any variants where there were multiple options in the ALT column. I'm less sure why it didn't just label them as "variants skipped due to mulitple variants found per position" I think that NEAT does not recognize comma separated ALTs as multiple variants but is instead only looking for chr/pos/ref duplicate lines.
That may be an item for the backlog, to at the very least correctly catalog what it's skipping.
Okay, I've looked a little more at the code and a couple of things are confusing me about the output you're seeing so I want to dive it a bit more. First, I'd need to know which version of NEAT were you are using. The GT comment doesn't jive and I can't see a solid logical reason in the code why those should have been skipped (it has code to handle multiple alt alleles). I'll need to look at it closer.
Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT.
I am using NEAT version v3.3 as well.
Is the reference of pretty reasonable size that you could email it to me?
From: Matthew Bird @.> Sent: Thursday, June 8, 2023 3:00 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74)
Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT.
I am using NEAT version v3.3 as well.
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/ncsa/NEAT/issues/74*issuecomment-1583260880__;Iw!!DZ3fjg!_SchjIxJ9eXmHYgTU_C-XdU5kiwW-B50_OJTZaVZqmRMsKllEeKlr7a2nA1Va3n_33PXBKPQSwnrHlAvfkmNh_G4YAvcGQ$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AGMI727ZI57YBWBZZAAAPNTXKIVKXANCNFSM6AAAAAAY67DCD4__;!!DZ3fjg!_SchjIxJ9eXmHYgTU_C-XdU5kiwW-B50_OJTZaVZqmRMsKllEeKlr7a2nA1Va3n_33PXBKPQSwnrHlAvfkmNh_HlTCDk-g$. You are receiving this because you commented.Message ID: @.***>
That seems to have failed. Try sending it directly to me: @.***
From: Matthew Bird @.> Sent: Thursday, June 8, 2023 3:15 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74)
I have zipped and attached the ref here. Sent from Mail for Windows From: Joshua AllenSent: 08 June 2023 21:06To: ncsa/NEATCc: Matthew Bird; AuthorSubject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Is the reference of pretty reasonable size that you could email it to me? From: Matthew Bird ***@***.***> Sent: Thursday, June 8, 2023 3:00 PM To: ncsa/NEAT ***@***.***> Cc: Allen, Josh ***@***.***>; Comment ***@***.***> Subject: Re: [ncsa/NEAT] Skipped variants in VCF file (Issue #74) Thanks for the reply! I worked out that if I replaced the GT field with 1/1 it will run fine regardless of comma separated ALTs (it even detects the variants of the GT field is ./. ) However doing this means that the correct ALT isn't chosen. The GT field of for example e.g. 60/60 will choose the 60th ALT allele which won't occur if they are all changed to 1/1. I have since created a script that will alter the VCF file so comply with NEAT. I am using NEAT version v3.3 as well. — Reply to this email directly, view it on GitHubIt stripped my email, but it's on my profile.
I added some subcategories to try to nail down what was getting filtered.
* 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax)
* 3629 variants filtered out by FILTER column
* 0 variants lacked a genotype field
* 368 variants had no associated allele
* 0 variants had an invalid position
It looks to me like most of them because the FILTER column was anything other than PASS. A few were because there was no associated allele in the genotype. So like if the genotype was 0/0, or if there were 2 alt alleles and the genotype was 1/1, the second alt allele gets filtered out. There's an additional thing that can happen in rare cases, where when NEAT is writing the reads out it will sometimes skip a variant if it is too close to another.
Note that the first one is one of several checks, it also looks to see if the REF == ALT and if there is missing data in a line. The filter is the most likely, but we can get finer resolution if that doesn't solve the problem.
Just for giggles I did one further breakdown:
* 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax)
* 0 variants had malformed vcf lines
* 0 variants had a missing allele or ref == alt
* 3629 variants filtered by FILTER column
* 0 variants lacked a genotype field
* 368 variants had no associated allele
* 0 variants had an invalid position
* 0 variants skipped due to multiple variants found per position
Fab! Thanks for taking a look at this, i ended up writing a script that would convert the VCF to a format that would work with NEAT v3.3 but will look at the changes you've made.
I am trying to introduce variance via a VCF file but when i try and do so NEAT will skip a large proportion of these variants and i'm not quite sure why (although i think it might be syntax that NEAT isn't used to). I have attached the VCF in question to this post to help but in principle i assume that NEAT is skipping all those variants that include any GT field that isn't 1/1. This may not be the problem but just though i would ask incase its being filtered out for another reason.
`Warning: Found variants without a GT field, assuming heterozygous.. found 3614 valid variants in input vcf. 3997 variants skipped: (qual filtered / ref genotypes / invalid syntax) 0 variants skipped due to multiple variants found per position
reading AL123456.3 Mycobacterium tuberculosis H37Rv complete genome.. 3.196 (sec)
sampling reads..`
N0004.txt