Closed marissa97 closed 1 year ago
The two file formats are not worlds apart, but it depends. Don't know if your GTF contains all the necessary information, see the documentation to check the required GFF structure here:
The gtf structure is quite different, there is no Parent-structure in ucsc-gtf. I tried to used ncbi gff3 now, with a slightly different format, it still does not work. I used ncbi gff3 from following link: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.25/ (Download > Refseq only > gff)
Please check if you are working with the latest version of bcftools and if you think there is a problem, please provide a small VCF test case to reproduce the problem. I believe the program should be able to parse NCBI's GFF.
I am using bcftools 1.10.2 Here is a sub-set of a test file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 9440
chr1 977330 rs2799066 T C 5202.06 . AC=2;AF=1;AN=2;BaseQRankSum=-1.607;DB;DP=167;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=31.34;ReadPosRankSum=-0.523;SOR=0.609 GT:AD:DP:GQ:PL:VF:GQX 1/1:1,165:166:99:5216,455,0:0.993976:99
chr1 981931 rs2465128 A G 8099.06 . AC=2;AF=1;AN=2;DB;DP=247;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.61;SOR=0.843 GT:AD:DP:GQ:PL:VF:GQX 1/1:0,241:241:99:8113,715,0:1:99
chr1 982994 rs10267 T C 4726.06 . AC=2;AF=1;AN=2;DB;DP=140;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.76;SOR=0.844 GT:AD:DP:GQ:PL:VF:GQX 1/1:0,140:140:99:4740,420,0:1:99
chr1 984302 rs9442391 T C 4414.06 . AC=2;AF=1;AN=2;DB;DP=157;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.23;SOR=0.693 GT:AD:DP:GQ:PL:VF:GQX 1/1:0,146:152:99:4428,437,0:0.960526:99
chr1 987200 rs9803031 C T 4439.06 . AC=2;AF=1;AN=2;DB;DP=131;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.15;SOR=0.966 GT:AD:DP:GQ:PL:VF:GQX 1/1:0,130:130:99:4453,390,0:1:99
chr1 990280 rs4275402 C T 7236.06 . AC=2;AF=1;AN=2;DB;DP=193;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=25.36;SOR=1.931 GT:AD:DP:GQ:PL:VF:GQX 1/1:0,190:190:99:7250,570,0:1:99
chr1 1269554 rs307377 T C 9079.06 . AC=2;AF=1;AN=2;BaseQRankSum=1.19;DB;DP=302;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=32.2;ReadPosRankSum=0.948;SOR=0.878 GT:AD:DP:GQ:PL:VF:GQX 1/1:2,280:282:99:9093,797,0:0.992908:99
chr1 1635004 rs874516 T C 2781.64 . AC=1;AF=0.5;AN=2;BaseQRankSum=-0.193;DB;DP=205;ExcessHet=3.0103;FS=1.714;MLEAC=1;MLEAF=0.5;MQ=55.75;MQRankSum=-6.805;QD=13.64;ReadPosRankSum=-0.231;SOR=0.849 GT:AD:DP:GQ:PL:VF:GQX 0/1:96,108:204:99:2789,0,2663:0.529412:99
chr1 1635749 rs1136997 C A 3878.64 . AC=1;AF=0.5;AN=2;BaseQRankSum=2.34;DB;DP=234;ExcessHet=3.0103;FS=6.52;MLEAC=1;MLEAF=0.5;MQ=54.07;MQRankSum=-4.721;QD=17.24;ReadPosRankSum=0.361;SOR=1.086 GT:AD:DP:GQ:PL:VF:GQX 0/1:95,130:225:99:3886,0,2369:0.577778:99
chr1 1636044 rs1059822 G A 3076.64 . AC=1;AF=0.5;AN=2;BaseQRankSum=0.93;DB;DP=221;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=50.77;MQRankSum=-8.235;QD=14.18;ReadPosRankSum=-0.035;SOR=0.71 GT:AD:DP:GQ:PL:VF:GQX 0/1:108,109:217:99:3084,0,2912:0.502304:99
I used following code:
bcftools csq -f hg19GenRef.fa -g genomic_chr.gff3 test.vcf -Ov -o testInfo.vcf -pR
This is the logs:
Parsing genomic_chr.gff3 ...
ignored: chr1 RefSeq region 1 249250621 . + . ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
ignored: chr1 BestRefSeq pseudogene 11874 14409 . + .ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
ignored: chr1 BestRefSeq transcript 11874 14409 . + .ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
[csq.c:729 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1 BestRefSeq exon 11874 12227 . + . ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
and no output ..
You'll need to update to the latest version 1.17. There was an update to GFF parsing code which, I tried, works with https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.25/
I updated the bcftools to 1.17 now, but still has not work. I also used the same gff, but I added "chr" in the annotation. Could it be the problem? How did you do it with the test data?
It still has not result any outputs. This is now the logs:
(base) marissa@moldiag:~$ bcftools csq -f /home/marissa/NGA_Pipeline/parameterFile/hg19GenRef.fa -g /home/marissa/Downloads/genomic_chr.gff3 /home/marissa/66111_S1/66111_S1.noGI.vcf -Ov -o /home/marissa/66111_S1/66111_S1.testInfo.vcf -pR
Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
ignored: chr1 RefSeq region 1 249250621 . + . ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
ignored: chr1 BestRefSeq pseudogene 11874 14409 . + .ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
ignored transcript, unknown biotype: chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: non-standard gene Parent notation in the GFF, expected "Parent=transcript:XXX", found chr1 BestRefSeq exon 11874 12227 . + .ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
ignored: chr1 BestRefSeq pseudogene 14362 29370 . - .ID=gene-WASH7P;Dbxref=GeneID:653635,HGNC:HGNC:38034;Name=WASH7P;description=WASP family homolog 7%2C pseudogene;gbkey=Gene;gene=WASH7P;gene_biotype=transcribed_pseudogene;gene_synonym=FAM39F,WASH5P;pseudo=true
ignored transcript, unknown biotype: chr1 BestRefSeq transcript 14362 29370 . - . ID=rna-NR_024540.1;Parent=gene-WASH7P;Dbxref=GeneID:653635,Genbank:NR_024540.1,HGNC:HGNC:38034;Name=NR_024540.1;gbkey=misc_RNA;gene=WASH7P;product=WASP family homolog 7%2C pseudogene;pseudo=true;transcript_id=NR_024540.1
Warning: non-standard gene ID notation in the GFF, expected "ID=gene:XXX", found chr1 BestRefSeq gene 17369 17436 . - . ID=gene-MIR6859-1;Dbxref=GeneID:102466751,HGNC:HGNC:50039,miRBase:MI0022705;Name=MIR6859-1;description=microRNA 6859-1;gbkey=Gene;gene=MIR6859-1;gene_biotype=miRNA;gene_synonym=hsa-mir-6859-1
ignored: chr1 BestRefSeq primary_transcript 17369 17436 . -. ID=rna-NR_106918.1;Parent=gene-MIR6859-1;Dbxref=GeneID:102466751,Genbank:NR_106918.1,HGNC:HGNC:50039,miRBase:MI0022705;Name=NR_106918.1;gbkey=precursor_RNA;gene=MIR6859-1;product=microRNA 6859-1;transcript_id=NR_106918.1
ignored: chr1 BestRefSeq miRNA 17369 17391 . - . ID=rna-MIR6859-1;Parent=rna-NR_106918.1;Dbxref=GeneID:102466751,miRBase:MIMAT0027619,HGNC:HGNC:50039,miRBase:MI0022705;gbkey=ncRNA;gene=MIR6859-1;product=hsa-miR-6859-3p
ignored: chr1 BestRefSeq miRNA 17409 17431 . - . ID=rna-MIR6859-1-2;Parent=rna-NR_106918.1;Dbxref=GeneID:102466751,miRBase:MIMAT0027618,HGNC:HGNC:50039,miRBase:MI0022705;gbkey=ncRNA;gene=MIR6859-1;product=hsa-miR-6859-5p
ignored: chr1 BestRefSeq primary_transcript 30366 30503 . +. ID=rna-NR_036051.1;Parent=gene-MIR1302-2;Dbxref=GeneID:100302278,Genbank:NR_036051.1,HGNC:HGNC:35294,miRBase:MI0006363;Name=NR_036051.1;gbkey=precursor_RNA;gene=MIR1302-2;product=microRNA 1302-2;transcript_id=NR_036051.1
ignored: chr1 BestRefSeq miRNA 30438 30458 . + . ID=rna-MIR1302-2;Parent=rna-NR_036051.1;Dbxref=GeneID:100302278,miRBase:MIMAT0005890,HGNC:HGNC:35294,miRBase:MI0006363;gbkey=ncRNA;gene=MIR1302-2;product=hsa-miR-1302
ignored: chr1 BestRefSeq lnc_RNA 34611 36081 . - . ID=rna-NR_026818.1;Parent=gene-FAM138A;Dbxref=GeneID:645520,Genbank:NR_026818.1,HGNC:HGNC:32334;Name=NR_026818.1;gbkey=ncRNA;gene=FAM138A;product=family with sequence similarity 138 member A;transcript_id=NR_026818.1
ignored: chr1 Curated Genomic pseudogene 52453 53396 . + .ID=gene-OR4G4P;Dbxref=GeneID:79504,HGNC:HGNC:14822;Name=OR4G4P;description=olfactory receptor family 4 subfamily G member 4 pseudogene;gbkey=Gene;gene=OR4G4P;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 63016 63885 . + .ID=gene-OR4G11P;Dbxref=GeneID:403263,HGNC:HGNC:31276;Name=OR4G11P;description=olfactory receptor family 4 subfamily G member 11 pseudogene;gbkey=Gene;gene=OR4G11P;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq mRNA 65419 71585 . + . ID=rna-NM_001005484.2;Parent=gene-OR4F5;Dbxref=GeneID:79501,Genbank:NM_001005484.2,HGNC:HGNC:14825;Name=NM_001005484.2;gbkey=mRNA;gene=OR4F5;product=olfactory receptor family 4 subfamily F member 5;tag=RefSeq Select;transcript_id=NM_001005484.2
ignored: chr1 Curated Genomic pseudogene 126642 129225 . - .ID=gene-SEPTIN14P18;Dbxref=GeneID:107126284,HGNC:HGNC:51705;Name=SEPTIN14P18;description=septin 14 pseudogene 18;gbkey=Gene;gene=SEPTIN14P18;gene_biotype=pseudogene;gene_synonym=SEPT14P18;pseudo=true
ignored: chr1 Curated Genomic pseudogene 131068 134836 . + .ID=gene-CICP27;Dbxref=GeneID:100420257,HGNC:HGNC:48835;Name=CICP27;description=capicua transcriptional repressor pseudogene 27;gbkey=Gene;gene=CICP27;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 134773 140566 . - . ID=rna-NR_039983.2;Parent=gene-LOC729737;Dbxref=GeneID:729737,Genbank:NR_039983.2;Name=NR_039983.2;gbkey=ncRNA;gene=LOC729737;product=uncharacterized LOC729737;transcript_id=NR_039983.2
ignored: chr1 Curated Genomic pseudogene 157784 157887 . - .ID=gene-RNU6-1100P;Dbxref=GeneID:106480049,HGNC:HGNC:48063;Name=RNU6-1100P;description=RNA%2C U6 small nuclear 1100%2C pseudogene;gbkey=Gene;gene=RNU6-1100P;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 164689 164792 . + .ID=gene-GTF2IP10;Dbxref=GeneID:107403151,HGNC:HGNC:51722;Name=GTF2IP10;description=general transcription factor IIi pseudogene 10;gbkey=Gene;gene=GTF2IP10;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 228262 228787 . - .ID=gene-RPL23AP21;Dbxref=GeneID:728481,HGNC:HGNC:35827;Name=RPL23AP21;description=ribosomal protein L23a pseudogene 21;gbkey=Gene;gene=RPL23AP21;gene_biotype=pseudogene;gene_synonym=RPL23A_1_1;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 323892 328581 . + . ID=rna-NR_028322.1;Parent=gene-LOC100132287;Dbxref=GeneID:100132287,Genbank:NR_028322.1;Name=NR_028322.1;gbkey=ncRNA;gene=LOC100132287;product=uncharacterized LOC100132287;transcript_id=NR_028322.1
ignored: chr1 Curated Genomic pseudogene 328518 332282 . - .ID=gene-CICP7;Dbxref=GeneID:100288667,HGNC:HGNC:37756;Name=CICP7;description=capicua transcriptional repressor pseudogene 7;gbkey=Gene;gene=CICP7;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 334127 334504 . + .ID=gene-SEPTIN14P13;Dbxref=GeneID:106481733,HGNC:HGNC:51700;Name=SEPTIN14P13;description=septin 14 pseudogene 13;gbkey=Gene;gene=SEPTIN14P13;gene_biotype=pseudogene;gene_synonym=SEPT14P13;pseudo=true
ignored: chr1 BestRefSeq mRNA 367659 368597 . + . ID=rna-NM_001005221.2;Parent=gene-OR4F29;Dbxref=GeneID:729759,Genbank:NM_001005221.2,HGNC:HGNC:31275;Name=NM_001005221.2;gbkey=mRNA;gene=OR4F29;product=olfactory receptor family 4 subfamily F member 29;tag=RefSeq Select;transcript_id=NM_001005221.2
ignored: chr1 Curated Genomic pseudogene 379067 379573 . - .ID=gene-WBP1LP7;Dbxref=GeneID:106479030,HGNC:HGNC:43955;Name=WBP1LP7;description=WBP1L pseudogene 7;gbkey=Gene;gene=WBP1LP7;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 470896 471368 . + .ID=gene-RPL23AP24;Dbxref=GeneID:728517,HGNC:HGNC:36176;Name=RPL23AP24;description=ribosomal protein L23a pseudogene 24;gbkey=Gene;gene=RPL23AP24;gene_biotype=pseudogene;gene_synonym=RPL23A_1_2;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 562760 564389 . - . ID=rna-NR_125957.1;Parent=gene-LOC101928626;Dbxref=GeneID:101928626,Genbank:NR_125957.1;Name=NR_125957.1;gbkey=ncRNA;gene=LOC101928626;product=uncharacterized LOC101928626;transcript_id=NR_125957.1
ignored: chr1 Curated Genomic pseudogene 564442 564820 . + .ID=gene-MTND1P23;Dbxref=GeneID:100887749,HGNC:HGNC:42092;Name=MTND1P23;description=MT-ND1 pseudogene 23;gbkey=Gene;gene=MTND1P23;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 565020 566066 . + .ID=gene-MTND2P28;Dbxref=GeneID:100652939,HGNC:HGNC:42129;Name=MTND2P28;description=MT-ND2 pseudogene 28;gbkey=Gene;gene=MTND2P28;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 566454 567996 . + .ID=gene-MTCO1P12;Dbxref=GeneID:107075141,HGNC:HGNC:52014;Name=MTCO1P12;description=MT-CO1 pseudogene 12;gbkey=Gene;gene=MTCO1P12;gene_biotype=pseudogene;pseudo=true
ignored: chr1 RefSeqFE silencer 567489 567783 . . .ID=id-GeneID:121967041;Dbxref=GeneID:121967041;Note=tiled region #5852%3B K562 Repressive DNase matched - State 24:Quies;experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:27701403];function=represses a minimal TATA promoter and a strong SV40 promoter by Sharpr-MPRA in K562 cells {active_cell/tissue: K562};gbkey=regulatory;regulatory_class=silencer
ignored: chr1 RefSeqFE biological_region 567489 567783 . +. ID=id-GeneID:121967041-2;Dbxref=GeneID:121967041;Name=biological region;gbkey=Region;standard_name=Sharpr-MPRA regulatory region 5852
ignored: chr1 BestRefSeq primary_transcript 567995 568065 . -. ID=rna-NR_162149.1;Parent=gene-MIR12136;Dbxref=GeneID:113219467,Genbank:NR_162149.1,HGNC:HGNC:54043,miRBase:MI0039740;Name=NR_162149.1;gbkey=precursor_RNA;gene=MIR12136;product=microRNA 12136;transcript_id=NR_162149.1
ignored: chr1 BestRefSeq miRNA 568048 568065 . - . ID=rna-MIR12136;Parent=rna-NR_162149.1;Dbxref=GeneID:113219467,miRBase:MIMAT0049032,HGNC:HGNC:54043,miRBase:MI0039740;gbkey=ncRNA;gene=MIR12136;product=hsa-miR-12136
ignored: chr1 Curated Genomic pseudogene 568137 568818 . + .ID=gene-MTCO2P12;Dbxref=GeneID:107075310,HGNC:HGNC:52028;Name=MTCO2P12;description=MT-CO2 pseudogene 12;gbkey=Gene;gene=MTCO2P12;gene_biotype=pseudogene;gene_synonym=COII,COX2,COXII,MT-CO2,MTCO2;pseudo=true
ignored: chr1 Curated Genomic pseudogene 568915 569121 . + .ID=gene-MTATP8P1;Dbxref=GeneID:106480795,HGNC:HGNC:44571;Name=MTATP8P1;description=MT-ATP8 pseudogene 1;gbkey=Gene;gene=MTATP8P1;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 569076 569756 . + .ID=gene-MTATP6P1;Dbxref=GeneID:106480796,HGNC:HGNC:44575;Name=MTATP6P1;description=MT-ATP6 pseudogene 1;gbkey=Gene;gene=MTATP6P1;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 569756 570302 . + .ID=gene-MTCO3P12;Dbxref=GeneID:107075270,HGNC:HGNC:52042;Name=MTCO3P12;description=MT-CO3 pseudogene 12;gbkey=Gene;gene=MTCO3P12;gene_biotype=pseudogene;pseudo=true
ignored: chr1 Curated Genomic pseudogene 610123 610626 . + .ID=gene-WBP1LP6;Dbxref=GeneID:106481797,HGNC:HGNC:43956;Name=WBP1LP6;description=WBP1L pseudogene 6;gbkey=Gene;gene=WBP1LP6;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq mRNA 621096 622034 . - . ID=rna-NM_001005277.1;Parent=gene-OR4F16;Dbxref=GeneID:81399,Genbank:NM_001005277.1,HGNC:HGNC:15079;Name=NM_001005277.1;gbkey=mRNA;gene=OR4F16;product=olfactory receptor family 4 subfamily F member 16;tag=RefSeq Select;transcript_id=NM_001005277.1
ignored: chr1 Curated Genomic pseudogene 653001 655582 . - .ID=gene-SEPTIN14P14;Dbxref=GeneID:107105354,HGNC:HGNC:51701;Name=SEPTIN14P14;description=septin 14 pseudogene 14;gbkey=Gene;gene=SEPTIN14P14;gene_biotype=pseudogene;gene_synonym=SEPT14P14;pseudo=true
ignored: chr1 Curated Genomic pseudogene 657426 661202 . + .ID=gene-CICP3;Dbxref=GeneID:100132630,HGNC:HGNC:37742;Name=CICP3;description=capicua transcriptional repressor pseudogene 3;gbkey=Gene;gene=CICP3;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 661139 714014 . - . ID=rna-NR_168328.1;Parent=gene-LOC100288069;Dbxref=GeneID:100288069,Genbank:NR_168328.1;Name=NR_168328.1;gbkey=ncRNA;gene=LOC100288069;product=uncharacterized LOC100288069;transcript_id=NR_168328.1
ignored: chr1 Curated Genomic pseudogene 693613 693716 . - .ID=gene-RNU6-1199P;Dbxref=GeneID:106480091,HGNC:HGNC:48162;Name=RNU6-1199P;description=RNA%2C U6 small nuclear 1199%2C pseudogene;gbkey=Gene;gene=RNU6-1199P;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 752751 755214 . + . ID=rna-NR_103536.1;Parent=gene-FAM87B;Dbxref=GeneID:400728,Genbank:NR_103536.1,HGNC:HGNC:32236;Name=NR_103536.1;gbkey=ncRNA;gene=FAM87B;product=family with sequence similarity 87 member B;transcript_id=NR_103536.1
ignored: chr1 BestRefSeq lnc_RNA 761586 762902 . - . ID=rna-NR_024321.1;Parent=gene-LINC00115;Dbxref=GeneID:79854,Genbank:NR_024321.1,HGNC:HGNC:26211;Name=NR_024321.1;gbkey=ncRNA;gene=LINC00115;product=long intergenic non-protein coding RNA 115;transcript_id=NR_024321.1
ignored: chr1 BestRefSeq lnc_RNA 762971 794826 . + . ID=rna-NR_047519.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047519.1,HGNC:HGNC:49377;Name=NR_047519.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 1;transcript_id=NR_047519.1
ignored: chr1 BestRefSeq lnc_RNA 762971 794826 . + . ID=rna-NR_047521.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047521.1,HGNC:HGNC:49377;Name=NR_047521.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 3;transcript_id=NR_047521.1
ignored: chr1 BestRefSeq lnc_RNA 762971 794826 . + . ID=rna-NR_047523.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047523.1,HGNC:HGNC:49377;Name=NR_047523.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 5;transcript_id=NR_047523.1
ignored: chr1 BestRefSeq lnc_RNA 762971 794826 . + . ID=rna-NR_047524.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047524.1,HGNC:HGNC:49377;Name=NR_047524.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 7;transcript_id=NR_047524.1
ignored: chr1 BestRefSeq lnc_RNA 762971 778984 . + . ID=rna-NR_047526.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047526.1,HGNC:HGNC:49377;Name=NR_047526.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 9;transcript_id=NR_047526.1
ignored: chr1 BestRefSeq lnc_RNA 763178 794826 . + . ID=rna-NR_047525.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047525.1,HGNC:HGNC:49377;Name=NR_047525.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 8;transcript_id=NR_047525.1
ignored: chr1 BestRefSeq lnc_RNA 803451 812182 . - . ID=rna-NR_027055.1;Parent=gene-FAM41C;Dbxref=GeneID:284593,Genbank:NR_027055.1,HGNC:HGNC:27635;Name=NR_027055.1;gbkey=ncRNA;gene=FAM41C;product=family with sequence similarity 41 member C;transcript_id=NR_027055.1
ignored: chr1 Curated Genomic pseudogene 808671 809894 . + .ID=gene-TUBB8P11;Dbxref=GeneID:388579,HGNC:HGNC:32337;Name=TUBB8P11;description=tubulin beta 8 class VIII pseudogene 11;gbkey=Gene;gene=TUBB8P11;gene_biotype=pseudogene;pseudo=true
ignored: chr1 BestRefSeq lnc_RNA 852198 855072 . - . ID=rna-NR_122045.1;Parent=gene-LINC02593;Dbxref=GeneID:100130417,Genbank:NR_122045.1,HGNC:HGNC:53933;Name=NR_122045.1;gbkey=ncRNA;gene=LINC02593;product=long intergenic non-protein coding RNA 2593%2C transcript variant 2;transcript_id=NR_122045.1
ignored: chr1 BestRefSeq lnc_RNA 852198 855072 . - . ID=rna-NR_026874.2;Parent=gene-LINC02593;Dbxref=GeneID:100130417,Genbank:NR_026874.2,HGNC:HGNC:53933;Name=NR_026874.2;gbkey=ncRNA;gene=LINC02593;product=long intergenic non-protein coding RNA 2593%2C transcript variant 1;transcript_id=NR_026874.2
ignored: chr1 BestRefSeq lnc_RNA 855623 860984 . - . ID=rna-NR_168405.1;Parent=gene-LOC107985728;Dbxref=GeneID:107985728,Genbank:NR_168405.1;Name=NR_168405.1;gbkey=ncRNA;gene=LOC107985728;product=collagen alpha-1(III) chain-like;transcript_id=NR_168405.1
ignored: chr1 BestRefSeq mRNA 859303 879954 . + . ID=rna-NM_001385640.1;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_001385640.1,HGNC:HGNC:28706,MIM:616765;Name=NM_001385640.1;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 2;transcript_id=NM_001385640.1
ignored: chr1 BestRefSeq mRNA 859303 879954 . + . ID=rna-NM_001385641.1;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_001385641.1,HGNC:HGNC:28706,MIM:616765;Name=NM_001385641.1;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 1;tag=RefSeq Select;transcript_id=NM_001385641.1
ignored: chr1 BestRefSeq mRNA 861111 879954 . + . ID=rna-NM_152486.4;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_152486.4,HGNC:HGNC:28706,MIM:616765;Name=NM_152486.4;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 3;transcript_id=NM_152486.4
ignored: chr1 BestRefSeq mRNA 879583 894636 . - . ID=rna-NM_015658.4;Parent=gene-NOC2L;Dbxref=GeneID:26155,Genbank:NM_015658.4,HGNC:HGNC:24517,MIM:610770;Name=NM_015658.4;gbkey=mRNA;gene=NOC2L;product=NOC2 like nucleolar associated transcriptional repressor;tag=RefSeq Select;transcript_id=NM_015658.4
ignored: chr1 BestRefSeq mRNA 895964 901099 . + . ID=rna-NM_198317.3;Parent=gene-KLHL17;Dbxref=GeneID:339451,Genbank:NM_198317.3,HGNC:HGNC:24023,MIM:619262;Name=NM_198317.3;gbkey=mRNA;gene=KLHL17;product=kelch like family member 17;tag=RefSeq Select;transcript_id=NM_198317.3
ignored: chr1 BestRefSeq mRNA 901862 911245 . + . ID=rna-NM_001367552.1;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_001367552.1,HGNC:HGNC:25284;Name=NM_001367552.1;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 3;transcript_id=NM_001367552.1
ignored: chr1 BestRefSeq mRNA 901862 911245 . + . ID=rna-NM_001160184.2;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_001160184.2,HGNC:HGNC:25284;Name=NM_001160184.2;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 2;transcript_id=NM_001160184.2
ignored: chr1 BestRefSeq mRNA 901862 911245 . + . ID=rna-NM_032129.3;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_032129.3,HGNC:HGNC:25284;Name=NM_032129.3;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 1;tag=RefSeq Select;transcript_id=NM_032129.3
ignored: chr1 BestRefSeq mRNA 910578 917473 . - . ID=rna-NM_001291366.2;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001291366.2,HGNC:HGNC:28208,MIM:615921;Name=NM_001291366.2;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 1;transcript_id=NM_001291366.2
ignored: chr1 BestRefSeq mRNA 910578 917473 . - . ID=rna-NM_001291367.2;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001291367.2,HGNC:HGNC:28208,MIM:615921;Name=NM_001291367.2;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 2;transcript_id=NM_001291367.2
ignored: chr1 BestRefSeq mRNA 910578 917473 . - . ID=rna-NM_001394713.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001394713.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001394713.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 5;tag=RefSeq Select;transcript_id=NM_001394713.1
ignored: chr1 BestRefSeq mRNA 910578 917473 . - . ID=rna-NM_001369897.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001369897.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001369897.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 3;transcript_id=NM_001369897.1
ignored: chr1 BestRefSeq mRNA 910578 916553 . - . ID=rna-NM_001369898.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001369898.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001369898.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 4;transcript_id=NM_001369898.1
ignored: chr1 BestRefSeq mRNA 934344 935477 . - . ID=rna-NM_021170.4;Parent=gene-HES4;Dbxref=GeneID:57801,Genbank:NM_021170.4,HGNC:HGNC:24149,MIM:608060;Name=NM_021170.4;gbkey=mRNA;gene=HES4;product=hes family bHLH transcription factor 4%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_021170.4
ignored: chr1 BestRefSeq mRNA 934344 935477 . - . ID=rna-NM_001142467.2;Parent=gene-HES4;Dbxref=GeneID:57801,Genbank:NM_001142467.2,HGNC:HGNC:24149,MIM:608060;Name=NM_001142467.2;gbkey=mRNA;gene=HES4;product=hes family bHLH transcription factor 4%2C transcript variant 1;transcript_id=NM_001142467.2
ignored: chr1 Curated Genomic pseudogene 943440 943670 . - .ID=gene-RPL39P12;Dbxref=GeneID:100270829,HGNC:HGNC:36166;Name=RPL39P12;description=ribosomal protein L39 pseudogene 12;gbkey=Gene;gene=RPL39P12;gene_biotype=pseudogene;gene_synonym=RPL39_1_3;pseudo=true
ignored: chr1 BestRefSeq mRNA 948877 949920 . + . ID=rna-NM_005101.4;Parent=gene-ISG15;Dbxref=GeneID:9636,Genbank:NM_005101.4,HGNC:HGNC:4053,MIM:147571;Name=NM_005101.4;gbkey=mRNA;gene=ISG15;product=ISG15 ubiquitin like modifier;tag=RefSeq Select;transcript_id=NM_005101.4
ignored: chr1 BestRefSeq mRNA 955500 991496 . + . ID=rna-NM_001305275.2;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_001305275.2,HGNC:HGNC:329,MIM:103320;Name=NM_001305275.2;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 1;transcript_id=NM_001305275.2
ignored: chr1 BestRefSeq mRNA 955500 991496 . + . ID=rna-NM_198576.4;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_198576.4,HGNC:HGNC:329,MIM:103320;Name=NM_198576.4;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_198576.4
ignored: chr1 BestRefSeq mRNA 969373 991496 . + . ID=rna-NM_001364727.2;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_001364727.2,HGNC:HGNC:329,MIM:103320;Name=NM_001364727.2;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 3;transcript_id=NM_001364727.2
ignored: chr1 BestRefSeq lnc_RNA 995114 1001833 . + . ID=rna-NR_148960.1;Parent=gene-LOC100288175;Dbxref=GeneID:100288175,Genbank:NR_148960.1;Name=NR_148960.1;gbkey=ncRNA;gene=LOC100288175;product=uncharacterized LOC100288175;transcript_id=NR_148960.1
ignored: chr1 BestRefSeq lnc_RNA 1001015 1004717 . - . ID=rna-NR_168432.1;Parent=gene-LOC105378948;Dbxref=GeneID:105378948,Genbank:NR_168432.1;Name=NR_168432.1;gbkey=ncRNA;gene=LOC105378948;product=uncharacterized LOC105378948%2C transcript variant 1;transcript_id=NR_168432.1
ignored: chr1 BestRefSeq lnc_RNA 1001015 1004717 . - . ID=rna-NR_168433.1;Parent=gene-LOC105378948;Dbxref=GeneID:105378948,Genbank:NR_168433.1;Name=NR_168433.1;gbkey=ncRNA;gene=LOC105378948;product=uncharacterized LOC105378948%2C transcript variant 2;transcript_id=NR_168433.1
ignored: chr1 RefSeqFE enhancer 1004592 1004706 . . .ID=id-GeneID:106783496;Dbxref=GeneID:106783496;experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:17135569];function=enhancer in Jurkat T cells;gbkey=regulatory;regulatory_class=enhancer
ignored: chr1 RefSeqFE biological_region 1004592 1004706 . +. ID=id-GeneID:106783496-2;Dbxref=GeneID:106783496;Name=biological region;gbkey=Region;standard_name=conserved acetylation island sequence 30 enhancer
ignored: chr1 BestRefSeq mRNA 1006347 1009686 . - . ID=rna-NM_001205252.2;Parent=gene-RNF223;Dbxref=GeneID:401934,Genbank:NM_001205252.2,HGNC:HGNC:40020;Name=NM_001205252.2;gbkey=mRNA;gene=RNF223;product=ring finger protein 223;tag=RefSeq Select;transcript_id=NM_001205252.2
ignored: chr1 BestRefSeq mRNA 1017203 1051469 . - . ID=rna-NM_017891.5;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_017891.5,HGNC:HGNC:26062;Name=NM_017891.5;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_017891.5
ignored: chr1 BestRefSeq mRNA 1017203 1051469 . - . ID=rna-NM_001363525.2;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_001363525.2,HGNC:HGNC:26062;Name=NM_001363525.2;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 3;transcript_id=NM_001363525.2
ignored: chr1 BestRefSeq mRNA 1017203 1051469 . - . ID=rna-NM_001330306.2;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_001330306.2,HGNC:HGNC:26062;Name=NM_001330306.2;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 1;transcript_id=NM_001330306.2
ignored: chr1 RefSeqFE enhancer 1047889 1048183 . . .ID=id-GeneID:112577469;Dbxref=GeneID:112577469;Note=tiled region #13538%3B HepG2 Activating non-DNase unmatched - State 18:Pol2%3B and K562 Activating DNase matched - State 14:Gen5';experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:27701403];function=activates a minimal TATA promoter and a strong SV40 promoter by Sharpr-MPRA in HepG2 and K562 cells;gbkey=regulatory;regulatory_class=enhancer
ignored: chr1 RefSeqFE biological_region 1047889 1048183 . +. ID=id-GeneID:112577469-2;Dbxref=GeneID:112577469;Name=biological region;gbkey=Region;standard_name=Sharpr-MPRA regulatory region 13538
Unknown strand: . .. . . ID=id-GeneID:115801415;Dbxref=GeneID:115801415;Note=candidate enhancer chr1.97 targeted for multiplex CRISPR interference;experiment=EXISTENCE:mutant phenotype evidence [ECO:0000015][PMID:30612741];function=positively regulates NADK gene expression in K562 cells;gbkey=regulatory;regulatory_class=transcriptional_cis_regulatory_region
Just to confirm, I was able to reproduce the problem now, previously most likely ended up testing a different GFF file.
I just pushed a commit which should make parsing of your GFF possible. Please try it out
Hey! Sorry for the late reply, i just have time to try it today. I installed bcftools through http://www.htslib.org/download/. How can I tested your commit? I have clone your repositories as written http://samtools.github.io/bcftools/howtos/install.html , however it still calls the bcftools from the first download.
If your clone is in /path/to/bcftools-git-clone
and the compilation was successful, there will be a bcftools binary in that directory. You can run it as /path/to/bcftools-git-clone/bcftools
.
Hello, now i am getting a new error message:
Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1 RefSeq cDNA_match 1588825 1588948 . - . ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)?
Use the --force option to override (at your own risk).
When i use --force:
Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1 RefSeq cDNA_match 1588825 1588948 . - . ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Warning: GFF contains overlapping CDS rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994.
Indexed 67150 transcripts, 767984 exons, 691224 CDSs, 0 UTRs
Warning: 33328 warnings were supressed, run with `--verbose 2` to see them all
Calling...
Error: the fasta reference does not match the VCF REF allele at chr1:111957501 .. fasta=C vcf=*
I used fasta ref from ncbi as well hg19.
This is the position that cause the error (annotated using ensembl gff3):
chr1 111957501 rs201350653 CTCACAGACTGATGACTCACAGGGG C 3027.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.273;DB;DP=71;ExcessHet=3.0103;FS=2.269;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=30.56;ReadPosRankSum=2.7;SOR=1.104;BCSQ=inframe_deletion|OVGP1|ENST00000369732|protein_coding|-|533TPVSHQSVS>518S|111957501CTCACAGACTGATGACTCACAGGGG>C;GI=OVGP1;TI=NM_002557;FC=Deletion_S533S GT:AD:DP:GQ:PL:VF:GQX:BCSQ 0/1:42,21:63:99:3035,0,6479:0.333333:99:3
Unfortunately this does not give enough information, it also confuses me that you're showing an annotated line, so where the error comes from? Can you please provide a small test case, the reference, GFF and the VCF, command line, the exact bcftools version (bcftools --version
), best in a separate issue?
Hallo, I tried to upload it here, but the test size is too big. Can i share it via email or something else ? Yeah, I understand the confusion. I tried to take the example on how it works with gff3 from ensemble.
Best if you can provide a link for us to download? As a last resort you could use a temporary storage, for example somewhere like https://wetransfer.com/
Here is the link to the test.vcf, fasta refseq and gff3 we use: https://we.tl/t-plVJprFyWE
It works now with --force
, but without there is still the same error:
Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)? Use the --force option to override (at your own risk).
gff3 files are gained from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.25/ (download -> refseq only -> gff -> change the seqid to chr*)
bcftools used are the one from: /path/to/bcftools-git-clone/bcftools
(version: bcftools 1.17-43-g875c067b)
I've got the test data, thank you. Can you show also the exact command you are running and the error you are geting? For me this runs okay:
bcftools csq -g genomic_chr.gff3 -f hg19GenRef.fa test.vcf -o /dev/null -l
The code is: bcftools/bcftools csq test.vcf -g genomic_chr.gff3 -f hg19GenRef.fa -o new_test.vcf
(without -l)
I tested with -l
, it produces the same error message
I tried the exact code as yours, it produces the same error message as well
The error message:
Parsing Documents/test_file/genomic_chr.gff3 ... Warning: Ignoring transcript with unknown biotype .. chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2 Warning: The GFF contains features with duplicate id .. chr1 RefSeq cDNA_match 1588825 1588948 . - . ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32 Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common) Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)? Use the --force option to override (at your own risk).
Okay, I misunderstood: you were getting an error only when the --force
option was not given but it worked otherwise.
That was the expected behavior because at the time of writing the code I did not have enough GFFs to test overlapping coding exons within the same transcript. Therefore the program would exit with an error and give a hint for the user to check manually if this was due to a ribosomal slippage. (This is usually described in the GFF annotations.)
There is a new commit now https://github.com/samtools/bcftools/commit/a8249495a6b8360bb8a0e447da0cc0a372a98a46 which makes overlapping CDS into a warning instead of an error.
I consider this problem fully resolved now, please open a new issue if anything else crops up.
No, I get an error with and without --force
as well
Then please show what the output looks like. I only see the ouput without the --force
option and I was not able to reproduce the error with your data myself.
Also above you say "It works now with --force, but without there is still the same error". I am very confused.
Is the new commit pushed? When i tried to pull, it said, that it is all up to date.
With --force
appears warnings and the output is empty.
Warnings:
Parsing Documents/test_file/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1 RefSeq cDNA_match 1588825 1588948 . - . ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Warning: GFF contains overlapping CDS rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994.
Indexed 67150 transcripts, 767984 exons, 691224 CDSs, 0 UTRs
Warning: 33328 warnings were supressed, run with --verbose 2 to see them all
Calling...
Unphased heterozygous genotype at 1:111957501, sample test. See the --phase option.
The program is complaining about unphased genotypes and asks to specify what to do with them
Unphased heterozygous genotype at 1:111957501, sample test. See the --phase option.
Hence the -l
used in the command in my example above https://github.com/samtools/bcftools/issues/1898#issuecomment-1563962316.
Note I reformatted your comment so that the final line stands out, the github markdown ignores newlines otherwise.
It works now! Thank you for your cooperation and support!
Hallo, I want to find the functional consequences for the variants in vcf file. I have been using bcftools csq, however the problem is, that I want to use ucsc and have been using gff3 from Ensemble. I could not find the gff3 from ucsc, only gtf. Can you maybe give me some suggestions ? Thank you