nh13 / DWGSIM

Whole Genome Simulator for Next-Generation Sequencing
GNU General Public License v2.0
92 stars 36 forks source link

Issues creating variants with a vCF #44

Closed nh13 closed 7 years ago

nh13 commented 7 years ago

Hi, Thank you very much for the tool you created.

I'm trying to use the option of supplying a vcf file. I used a standard vcf file, I also made it simple to be mach for your example. Unfortunally when i give i a vcf file (in my case about 70k variants), for some of the variants i get a weirds error : "Error: contig not found [48997]" After, looking into it i found the problem was in one of the lines: "chr19 15848997 rs140063881 G GT 50 PASS AF=1.0;pl=3" which seems very standard line/variant. i deleted it and then i got a similar issue to the next variant, i deleted it as well and again... I susspect it might be a memory issue, since i tried to took these specific variants and ut them in a smaller file and it worked.

Attached is the problematic file and this is the command i sue

./dwgsim -v exmp2.vcf ../hg19/chr19.fa chr19_sim

Thanks, Yaron

nh13 commented 7 years ago

Look at the ##contig lines in your VCF, they make no mention of chromosome 19. What was the command you used to create the VCF (presumably from dwgsim)? Did you also use ../hg19/chr19.fa as your reference?

yeinhorn commented 7 years ago

I added the "chr19" conting to the header. i get the error "Error: contig not found [PASS]" now. Yes i gave it only chr19 reference as i'm testing only against chr19 variants. I tried it also with the whole reference genome with all the chromosomes but i still get the same error (and again if i take a smaller number of variants it does work). I used a standard vcf file taken from GIAB and extracted only chr19 variants, but i got an error similar what's described here. So I removed the info columns and put instead the info column supplied in the example files of dwgsim as well at the header, to make sure it wasn't caused but some wrong info. header looks like that now:

fileformat=VCFv4.1

contig=

INFO=

INFO=

attached is the file exmp.vcf which doesn't work. the same file without the last 500 variants that does work, exmp_short.vcf. the orig files without ~500 first variants (but with the last 500 variants) that does work exmp_no_first_lines.vcf exmp_no_first_lines.vcf.zip exmp_short.vcf.zip

exmp.vcf.zip

nh13 commented 7 years ago

Rather than trying to modifying the VCF to make it work, can you tell me how you created the VCF in the first place? I suspect with dwgsim using some custom reference? The contig lines in the example VCF do not match chr19. Let me know how you creat d the VCF.

yeinhorn commented 7 years ago

I took it from GIAB for NA12878, and add it "chr" at the beginning (#CHR column) using awk.

nh13 commented 7 years ago

You definitely need the proper contig lines. Try simulating a few reads without an input VCF and look at the format of the output VCF.

yeinhorn commented 7 years ago

Thanks but yes, I changed it to have a proper contig lines. I'm attaching again the vcf header of the exmp.vcf file:

fileformat=VCFv4.1

contig=

INFO=

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO

and this is the vcf output header when running without vcf variants input file:

fileformat=VCFv4.1

contig=

INFO=

INFO=

INFO=

They look identical, so I don't think this is the issue. I also try to use the output vcf as an input and again i got kind of a similar error: "Warning: strand of the mutation not found; please use the 'pl' tag. Error: contig not found [RT]"

I think there's some memory allocation problem or something similar when supplying a variant file as an input.

nh13 commented 7 years ago

Ok, I'll take a look, but it won't be for a few weeks as I am taking time off for the birth of my second kid. My apologies for the delay.

yeinhorn commented 7 years ago

Okay thanks for the help and update, and congratulations :)