ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
37 stars 12 forks source link

Inserting New variants using VCF file not inserting the SNP/INDELS #68

Closed vinaydeep26 closed 1 year ago

vinaydeep26 commented 1 year ago

Thank you for this amazing tool. I'm trying to simulate a Whole exome Sequencing data from a Mouse fasta file. I want to insert certain SNPs/Indels using the -v filename.vcf option In order to check the working of the code I created a toy Fasta file (littleFasta.fa) which just contains 300 bases. just looks like that in its entirety.

chr11 CCCTGCCAGGGCTGCTGGTGATTCTCCACATCCTTAGGCTCCGCGGTGCTTACCTTCAGG ACTCTCCAGTTGTAACCCCTTTGTTGGGATGCCTGGGAGCCAGACAAGGTCACCCCATTT TTTAAGAGAGGACGAAGGTGAGAGGGAGACTACAATGAAAAGGTTGGGAGGGGCCCCAGG CATGGCCCCTGTGTGTGGAAAACACAGGTGACCACCGGCACCCAGACTGTCTACACTATG CCTCCAGAAGGCACTTTGCCTAGCAACAGGCCTGACCATGCAGCGCTGGTCCAATCTCTC I use a toy BED file (babyBed.bed) in order to specify the exonic regions for targeted sequencing.

chr11 0000000 0000100 NR_003517.2 0.000000 + mm39_ncbiRefSeq exon . gene_id "NR_003517.2"; transcript_id "NR_003517.2"; chr11 0000200 0000300 NR_003517.2 0.000000 + mm39_ncbiRefSeq exon . gene_id "NR_003517.2"; transcript_id "NR_003517.2";

the VCF file for variant specifying is called spec4.vcf and looks like:

fileformat=VCFv4.2

fileDate=20090805

source=myImputationProgramV3.1

reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

phasing=partial

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FILTER=

FILTER=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001

11 1 . C A 100 PASS . GT 0/1 11 30 . A T 100 PASS . GT 1/1 11 60 . G T 100 PASS . GT 1/1 11 240 . G A,T 100 PASS . GT 1/2

The variants get passed but when I see the reads the bases that are supposed to change. For example C > A the Reads don't actually have the substitution.

I used this command to run the simulation.: python gen_reads.py -r littleFasta.fa -R 20 -c 1 --discard-offtarget -o outputfile -v spec.vcf -M 0 -tr babyBed.bed

Screenshot (138)

as you can see the read Spanning the Third variant G > T which is the last base in the first line of littleFasta.fa I have no reads that have that variant inserted into the read. What am I doing wrong here? Can anyone please help me out? thank you

vinaydeep26 commented 1 year ago

@joshfactorial

joshfactorial commented 1 year ago

I will take a look at this.

vinaydeep26 commented 1 year ago

Screenshot (147) Screenshot (148)

I try to increase the coverage too. So out of the output reads I get, I try to find the short sequence before it the corresponding VCF variant (EG: on position 60 there should be a T instead of a G) to see if the program did the job of inserting variants at the specific position but it doesn't seem to be the case. what am I doing differently? @joshfactorial

vinaydeep26 commented 1 year ago

Thank you for the Solution @joshfactorial . The problem was that in my vcf file the chromosome names are just "11" whereas in my VCF file its >chr11. so by changing the 11s in my VCf file to "chr11"s , I accurately see the indels as well as the SNPS. Thank you for the Amazing support and quick response.