pcingola / SnpSift

Other
35 stars 20 forks source link

Unable to pass SnpEff output to SnpSift #75

Closed rhdolin closed 2 years ago

rhdolin commented 2 years ago

Hi, I'm not sure what I'm doing wrong. Starting with a 1000 Genomes VCF file, I run

java -jar snpEff.jar -v GRCh37.p13 .\HG00403.MNV.WES.GRCh37.vcf.gz > HG00403.MNV.snfEff.vcf

I then try to run the output through SnpSift

java -jar SnpSift.jar annotate .\gnomad.exomes.r2.1.1.sites.vcf.bgz -info "AF" .\HG00403.MNV.snfEff.vcf > .\HG00403.MNV.snfEff.gnomAD.vcf

but I get these errors:

VcfFileIterator.parseVcfLine(133): Fatal error reading file '.\HG00403.MNV.snfEff.vcf' (line: 1): ÿ?##fileformat=VCFv4.1 Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Impropper VCF entry: Not enough fields (missing tab separators?). ÿ?##fileformat=VCFv4.1 at org.snpeff.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:134) at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:185) at org.snpeff.fileIterator.VcfFileIterator.readNext(VcfFileIterator.java:58) at org.snpeff.fileIterator.FileIterator.hasNext(FileIterator.java:123) at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:77) at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410) at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397) at org.snpsift.SnpSift.run(SnpSift.java:580) at org.snpsift.SnpSift.main(SnpSift.java:76) Caused by: java.lang.RuntimeException: Impropper VCF entry: Not enough fields (missing tab separators?). ÿ?##fileformat=VCFv4.1 at org.snpeff.vcf.VcfEntry.parse(VcfEntry.java:1035) at org.snpeff.vcf.VcfEntry.<init>(VcfEntry.java:247) at org.snpeff.fileIterator.VcfFileIterator.parseVcfLine(VcfFileIterator.java:131) ... 8 more

I've also tried to bgzip and tabix index the output from snpEff, and I get these errors:

[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "#" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used? The offending line was: "" [E::hts_idx_push] Unsorted positions on sequence #1: 9 followed by 1 tbx_index_build failed: HG00403.MNV.snpEff.vcf.gz

Can anyone see what the issue is? Thanks!

rhdolin commented 2 years ago

See biostars discussion (https://www.biostars.org/p/9530170/). Apparently the VCF output of snpEff was UTF-16 encoded, and needs to be converted to ASCII or UTF-8 for further processing. (I was running SnpEff version 5.1d, build 2022-04-19 15:49, on a windows machine)

pcingola commented 2 years ago

Thank you for solving the error. Yes, the VCF input file (for SnpEff) should not be UTF-16.