ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Add support for multi-allelic variants within input VCF #58

Closed blajoie closed 1 year ago

blajoie commented 2 years ago

Hi!

We've noticed that multi-allelic VCF entries are not supported within NEAT.

NEAT seems to always chooses the first ALT within the VCF entry. https://github.com/ncsa/NEAT/blob/21f0f917540d73cf2d1d9eee964e8918ea680dcf/source/SequenceContainer.py#L448

Curious on your thoughts re. whether or not this could ultimately supported?

i.e., here only a C will be inserted. Whereas ideally (with plody=2), a 1/2 (C/G) would be inserted into chr1:156736767, wholly replacing the A ref.

chr1 156736767 . A C,G 50 PASS platforms=5;platformnames=Illumina,PacBio,CG,10X,Solid;datasets=7;datasetnames=HiSeqPE300x,CCS15kb_20kb,CGnormal,HiSeq250x250,10XChromiumLR,HiSeqMatePair,SolidSE75bp;callsets=11;callsetnames=HiSeqPE300xSentieon,CCS15kb_20kbDV,CCS15kb_20kbGATK4,CGnormal,HiSeqPE300xfreebayes,HiSeq250x250Sentieon,10XLRGATK,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeqPE300xSentieon_callable,CS_CCS15kb_20kbDV_callable,CS_10XLRGATK_callable,CS_CCS15kb_20kbGATK4_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable,CS_HiSeq250x250Sentieon_callable GT:PS:DP:ADALL:AD:GQ 1/2:.:1144:0,181,158:51,294,258:546

Granted this may become a bit tricky with the ability for users to also choose a ploidy.
Assuming ploidy matches max(#alts) in a VCF, could multi-allelic support be formally added?

Thoughts? cc @ajaltomare

joshfactorial commented 2 years ago

Correct, that was the old behavior. We're currently working on version 4 that will support multiallelic variants.

blajoie commented 2 years ago

Thanks @joshfactorial - any insight into when v4 may be released? Something we could look at now (and or help with?) cc @ajaltomare

joshfactorial commented 2 years ago

We're shooting for the end of the summer. Most of my work is on the feature/parallelization branch. Which at this point is misnamed, since I'm delaying actual parallelization to finish up improving the core functions. Current status is it will successfully produce a vcf and a fasta file. We're adding bam creation and the main work now is the actual generating of reads. I'm hoping to have something up and running by the end of the month. We could probably use some user testing at that point.

blajoie commented 2 years ago

Thanks @joshfactorial - we are indeed happy to test and push PR/fixes once the v4 code has stabilized. We have added a few fixes/improvements to v3.2.xxx - but perhaps it is best to wait and refocus on v4.

Will check back in end of this month! Also feel free to email me at bryan.lajoie at elembio.com. Would be happy to meet sometime to better understand the v4 vision and/or discuss how we could be of help to the effort!

blajoie commented 1 year ago

Hi @joshfactorial - checking in re. v4 update/status? Anything to share on that front? Happy to help with testing and or dev work to help get this across the line!

joshfactorial commented 1 year ago

We released a 4.0 beta under releases. There are instructions on how to use it on the develop branch. We're still finalizing the bam creation section and some of the utilities. Feel free to take a look and offer any suggestions.