samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
636 stars 242 forks source link

bcftools csq with ucsc gff3 #1898

Closed marissa97 closed 1 year ago

marissa97 commented 1 year ago

Hallo, I want to find the functional consequences for the variants in vcf file. I have been using bcftools csq, however the problem is, that I want to use ucsc and have been using gff3 from Ensemble. I could not find the gff3 from ucsc, only gtf. Can you maybe give me some suggestions ? Thank you

pd3 commented 1 year ago

The two file formats are not worlds apart, but it depends. Don't know if your GTF contains all the necessary information, see the documentation to check the required GFF structure here:

https://github.com/samtools/bcftools/blob/e3dda270c44ebd0781f1e825aeafa117bc885049/doc/bcftools.txt#L1314-L1342

marissa97 commented 1 year ago

The gtf structure is quite different, there is no Parent-structure in ucsc-gtf. I tried to used ncbi gff3 now, with a slightly different format, it still does not work. I used ncbi gff3 from following link: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.25/ (Download > Refseq only > gff)

pd3 commented 1 year ago

Please check if you are working with the latest version of bcftools and if you think there is a problem, please provide a small VCF test case to reproduce the problem. I believe the program should be able to parse NCBI's GFF.

marissa97 commented 1 year ago

I am using bcftools 1.10.2 Here is a sub-set of a test file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  9440
chr1    977330  rs2799066   T   C   5202.06 .   AC=2;AF=1;AN=2;BaseQRankSum=-1.607;DB;DP=167;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=31.34;ReadPosRankSum=-0.523;SOR=0.609   GT:AD:DP:GQ:PL:VF:GQX   1/1:1,165:166:99:5216,455,0:0.993976:99
chr1    981931  rs2465128   A   G   8099.06 .   AC=2;AF=1;AN=2;DB;DP=247;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.61;SOR=0.843 GT:AD:DP:GQ:PL:VF:GQX   1/1:0,241:241:99:8113,715,0:1:99
chr1    982994  rs10267 T   C   4726.06 .   AC=2;AF=1;AN=2;DB;DP=140;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=33.76;SOR=0.844 GT:AD:DP:GQ:PL:VF:GQX   1/1:0,140:140:99:4740,420,0:1:99
chr1    984302  rs9442391   T   C   4414.06 .   AC=2;AF=1;AN=2;DB;DP=157;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=30.23;SOR=0.693 GT:AD:DP:GQ:PL:VF:GQX   1/1:0,146:152:99:4428,437,0:0.960526:99
chr1    987200  rs9803031   C   T   4439.06 .   AC=2;AF=1;AN=2;DB;DP=131;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=34.15;SOR=0.966 GT:AD:DP:GQ:PL:VF:GQX   1/1:0,130:130:99:4453,390,0:1:99
chr1    990280  rs4275402   C   T   7236.06 .   AC=2;AF=1;AN=2;DB;DP=193;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=25.36;SOR=1.931 GT:AD:DP:GQ:PL:VF:GQX   1/1:0,190:190:99:7250,570,0:1:99
chr1    1269554 rs307377    T   C   9079.06 .   AC=2;AF=1;AN=2;BaseQRankSum=1.19;DB;DP=302;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;MQRankSum=0;QD=32.2;ReadPosRankSum=0.948;SOR=0.878   GT:AD:DP:GQ:PL:VF:GQX   1/1:2,280:282:99:9093,797,0:0.992908:99
chr1    1635004 rs874516    T   C   2781.64 .   AC=1;AF=0.5;AN=2;BaseQRankSum=-0.193;DB;DP=205;ExcessHet=3.0103;FS=1.714;MLEAC=1;MLEAF=0.5;MQ=55.75;MQRankSum=-6.805;QD=13.64;ReadPosRankSum=-0.231;SOR=0.849   GT:AD:DP:GQ:PL:VF:GQX   0/1:96,108:204:99:2789,0,2663:0.529412:99
chr1    1635749 rs1136997   C   A   3878.64 .   AC=1;AF=0.5;AN=2;BaseQRankSum=2.34;DB;DP=234;ExcessHet=3.0103;FS=6.52;MLEAC=1;MLEAF=0.5;MQ=54.07;MQRankSum=-4.721;QD=17.24;ReadPosRankSum=0.361;SOR=1.086   GT:AD:DP:GQ:PL:VF:GQX   0/1:95,130:225:99:3886,0,2369:0.577778:99
chr1    1636044 rs1059822   G   A   3076.64 .   AC=1;AF=0.5;AN=2;BaseQRankSum=0.93;DB;DP=221;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=50.77;MQRankSum=-8.235;QD=14.18;ReadPosRankSum=-0.035;SOR=0.71  GT:AD:DP:GQ:PL:VF:GQX   0/1:108,109:217:99:3084,0,2912:0.502304:99

I used following code:

bcftools csq -f hg19GenRef.fa -g genomic_chr.gff3 test.vcf -Ov -o testInfo.vcf -pR

This is the logs:

Parsing genomic_chr.gff3 ...
ignored: chr1   RefSeq  region  1   249250621   .   +   .   ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
ignored: chr1   BestRefSeq  pseudogene  11874   14409   .   +   .ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
ignored: chr1   BestRefSeq  transcript  11874   14409   .   +   .ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
[csq.c:729 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1   BestRefSeq  exon    11874   12227   .   +   .   ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2

and no output ..

pd3 commented 1 year ago

You'll need to update to the latest version 1.17. There was an update to GFF parsing code which, I tried, works with https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.25/

marissa97 commented 1 year ago

I updated the bcftools to 1.17 now, but still has not work. I also used the same gff, but I added "chr" in the annotation. Could it be the problem? How did you do it with the test data?

It still has not result any outputs. This is now the logs:

(base) marissa@moldiag:~$ bcftools csq -f /home/marissa/NGA_Pipeline/parameterFile/hg19GenRef.fa -g /home/marissa/Downloads/genomic_chr.gff3 /home/marissa/66111_S1/66111_S1.noGI.vcf -Ov -o /home/marissa/66111_S1/66111_S1.testInfo.vcf -pR
Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
ignored: chr1   RefSeq  region  1   249250621   .   +   .   ID=NC_000001.10:1..249250621;Dbxref=taxon:9606;Name=1;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA
ignored: chr1   BestRefSeq  pseudogene  11874   14409   .   +   .ID=gene-DDX11L1;Dbxref=GeneID:100287102,HGNC:HGNC:37102;Name=DDX11L1;description=DEAD/H-box helicase 11 like 1 (pseudogene);gbkey=Gene;gene=DDX11L1;gene_biotype=transcribed_pseudogene;pseudo=true
ignored transcript, unknown biotype: chr1   BestRefSeq  transcript  11874   14409   .   +   .   ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: non-standard gene Parent notation in the GFF, expected "Parent=transcript:XXX", found chr1 BestRefSeq  exon    11874   12227   .   +   .ID=exon-NR_046018.2-1;Parent=rna-NR_046018.2;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
ignored: chr1   BestRefSeq  pseudogene  14362   29370   .   -   .ID=gene-WASH7P;Dbxref=GeneID:653635,HGNC:HGNC:38034;Name=WASH7P;description=WASP family homolog 7%2C pseudogene;gbkey=Gene;gene=WASH7P;gene_biotype=transcribed_pseudogene;gene_synonym=FAM39F,WASH5P;pseudo=true
ignored transcript, unknown biotype: chr1   BestRefSeq  transcript  14362   29370   .   -   .   ID=rna-NR_024540.1;Parent=gene-WASH7P;Dbxref=GeneID:653635,Genbank:NR_024540.1,HGNC:HGNC:38034;Name=NR_024540.1;gbkey=misc_RNA;gene=WASH7P;product=WASP family homolog 7%2C pseudogene;pseudo=true;transcript_id=NR_024540.1
Warning: non-standard gene ID notation in the GFF, expected "ID=gene:XXX", found chr1   BestRefSeq  gene    17369   17436   .   -   .   ID=gene-MIR6859-1;Dbxref=GeneID:102466751,HGNC:HGNC:50039,miRBase:MI0022705;Name=MIR6859-1;description=microRNA 6859-1;gbkey=Gene;gene=MIR6859-1;gene_biotype=miRNA;gene_synonym=hsa-mir-6859-1
ignored: chr1   BestRefSeq  primary_transcript  17369   17436   .   -.  ID=rna-NR_106918.1;Parent=gene-MIR6859-1;Dbxref=GeneID:102466751,Genbank:NR_106918.1,HGNC:HGNC:50039,miRBase:MI0022705;Name=NR_106918.1;gbkey=precursor_RNA;gene=MIR6859-1;product=microRNA 6859-1;transcript_id=NR_106918.1
ignored: chr1   BestRefSeq  miRNA   17369   17391   .   -   .   ID=rna-MIR6859-1;Parent=rna-NR_106918.1;Dbxref=GeneID:102466751,miRBase:MIMAT0027619,HGNC:HGNC:50039,miRBase:MI0022705;gbkey=ncRNA;gene=MIR6859-1;product=hsa-miR-6859-3p
ignored: chr1   BestRefSeq  miRNA   17409   17431   .   -   .   ID=rna-MIR6859-1-2;Parent=rna-NR_106918.1;Dbxref=GeneID:102466751,miRBase:MIMAT0027618,HGNC:HGNC:50039,miRBase:MI0022705;gbkey=ncRNA;gene=MIR6859-1;product=hsa-miR-6859-5p
ignored: chr1   BestRefSeq  primary_transcript  30366   30503   .   +.  ID=rna-NR_036051.1;Parent=gene-MIR1302-2;Dbxref=GeneID:100302278,Genbank:NR_036051.1,HGNC:HGNC:35294,miRBase:MI0006363;Name=NR_036051.1;gbkey=precursor_RNA;gene=MIR1302-2;product=microRNA 1302-2;transcript_id=NR_036051.1
ignored: chr1   BestRefSeq  miRNA   30438   30458   .   +   .   ID=rna-MIR1302-2;Parent=rna-NR_036051.1;Dbxref=GeneID:100302278,miRBase:MIMAT0005890,HGNC:HGNC:35294,miRBase:MI0006363;gbkey=ncRNA;gene=MIR1302-2;product=hsa-miR-1302
ignored: chr1   BestRefSeq  lnc_RNA 34611   36081   .   -   .   ID=rna-NR_026818.1;Parent=gene-FAM138A;Dbxref=GeneID:645520,Genbank:NR_026818.1,HGNC:HGNC:32334;Name=NR_026818.1;gbkey=ncRNA;gene=FAM138A;product=family with sequence similarity 138 member A;transcript_id=NR_026818.1
ignored: chr1   Curated Genomic pseudogene  52453   53396   .   +   .ID=gene-OR4G4P;Dbxref=GeneID:79504,HGNC:HGNC:14822;Name=OR4G4P;description=olfactory receptor family 4 subfamily G member 4 pseudogene;gbkey=Gene;gene=OR4G4P;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  63016   63885   .   +   .ID=gene-OR4G11P;Dbxref=GeneID:403263,HGNC:HGNC:31276;Name=OR4G11P;description=olfactory receptor family 4 subfamily G member 11 pseudogene;gbkey=Gene;gene=OR4G11P;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  mRNA    65419   71585   .   +   .   ID=rna-NM_001005484.2;Parent=gene-OR4F5;Dbxref=GeneID:79501,Genbank:NM_001005484.2,HGNC:HGNC:14825;Name=NM_001005484.2;gbkey=mRNA;gene=OR4F5;product=olfactory receptor family 4 subfamily F member 5;tag=RefSeq Select;transcript_id=NM_001005484.2
ignored: chr1   Curated Genomic pseudogene  126642  129225  .   -   .ID=gene-SEPTIN14P18;Dbxref=GeneID:107126284,HGNC:HGNC:51705;Name=SEPTIN14P18;description=septin 14 pseudogene 18;gbkey=Gene;gene=SEPTIN14P18;gene_biotype=pseudogene;gene_synonym=SEPT14P18;pseudo=true
ignored: chr1   Curated Genomic pseudogene  131068  134836  .   +   .ID=gene-CICP27;Dbxref=GeneID:100420257,HGNC:HGNC:48835;Name=CICP27;description=capicua transcriptional repressor pseudogene 27;gbkey=Gene;gene=CICP27;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 134773  140566  .   -   .   ID=rna-NR_039983.2;Parent=gene-LOC729737;Dbxref=GeneID:729737,Genbank:NR_039983.2;Name=NR_039983.2;gbkey=ncRNA;gene=LOC729737;product=uncharacterized LOC729737;transcript_id=NR_039983.2
ignored: chr1   Curated Genomic pseudogene  157784  157887  .   -   .ID=gene-RNU6-1100P;Dbxref=GeneID:106480049,HGNC:HGNC:48063;Name=RNU6-1100P;description=RNA%2C U6 small nuclear 1100%2C pseudogene;gbkey=Gene;gene=RNU6-1100P;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  164689  164792  .   +   .ID=gene-GTF2IP10;Dbxref=GeneID:107403151,HGNC:HGNC:51722;Name=GTF2IP10;description=general transcription factor IIi pseudogene 10;gbkey=Gene;gene=GTF2IP10;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  228262  228787  .   -   .ID=gene-RPL23AP21;Dbxref=GeneID:728481,HGNC:HGNC:35827;Name=RPL23AP21;description=ribosomal protein L23a pseudogene 21;gbkey=Gene;gene=RPL23AP21;gene_biotype=pseudogene;gene_synonym=RPL23A_1_1;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 323892  328581  .   +   .   ID=rna-NR_028322.1;Parent=gene-LOC100132287;Dbxref=GeneID:100132287,Genbank:NR_028322.1;Name=NR_028322.1;gbkey=ncRNA;gene=LOC100132287;product=uncharacterized LOC100132287;transcript_id=NR_028322.1
ignored: chr1   Curated Genomic pseudogene  328518  332282  .   -   .ID=gene-CICP7;Dbxref=GeneID:100288667,HGNC:HGNC:37756;Name=CICP7;description=capicua transcriptional repressor pseudogene 7;gbkey=Gene;gene=CICP7;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  334127  334504  .   +   .ID=gene-SEPTIN14P13;Dbxref=GeneID:106481733,HGNC:HGNC:51700;Name=SEPTIN14P13;description=septin 14 pseudogene 13;gbkey=Gene;gene=SEPTIN14P13;gene_biotype=pseudogene;gene_synonym=SEPT14P13;pseudo=true
ignored: chr1   BestRefSeq  mRNA    367659  368597  .   +   .   ID=rna-NM_001005221.2;Parent=gene-OR4F29;Dbxref=GeneID:729759,Genbank:NM_001005221.2,HGNC:HGNC:31275;Name=NM_001005221.2;gbkey=mRNA;gene=OR4F29;product=olfactory receptor family 4 subfamily F member 29;tag=RefSeq Select;transcript_id=NM_001005221.2
ignored: chr1   Curated Genomic pseudogene  379067  379573  .   -   .ID=gene-WBP1LP7;Dbxref=GeneID:106479030,HGNC:HGNC:43955;Name=WBP1LP7;description=WBP1L pseudogene 7;gbkey=Gene;gene=WBP1LP7;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  470896  471368  .   +   .ID=gene-RPL23AP24;Dbxref=GeneID:728517,HGNC:HGNC:36176;Name=RPL23AP24;description=ribosomal protein L23a pseudogene 24;gbkey=Gene;gene=RPL23AP24;gene_biotype=pseudogene;gene_synonym=RPL23A_1_2;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 562760  564389  .   -   .   ID=rna-NR_125957.1;Parent=gene-LOC101928626;Dbxref=GeneID:101928626,Genbank:NR_125957.1;Name=NR_125957.1;gbkey=ncRNA;gene=LOC101928626;product=uncharacterized LOC101928626;transcript_id=NR_125957.1
ignored: chr1   Curated Genomic pseudogene  564442  564820  .   +   .ID=gene-MTND1P23;Dbxref=GeneID:100887749,HGNC:HGNC:42092;Name=MTND1P23;description=MT-ND1 pseudogene 23;gbkey=Gene;gene=MTND1P23;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  565020  566066  .   +   .ID=gene-MTND2P28;Dbxref=GeneID:100652939,HGNC:HGNC:42129;Name=MTND2P28;description=MT-ND2 pseudogene 28;gbkey=Gene;gene=MTND2P28;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  566454  567996  .   +   .ID=gene-MTCO1P12;Dbxref=GeneID:107075141,HGNC:HGNC:52014;Name=MTCO1P12;description=MT-CO1 pseudogene 12;gbkey=Gene;gene=MTCO1P12;gene_biotype=pseudogene;pseudo=true
ignored: chr1   RefSeqFE    silencer    567489  567783  .   .   .ID=id-GeneID:121967041;Dbxref=GeneID:121967041;Note=tiled region #5852%3B K562 Repressive DNase matched - State 24:Quies;experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:27701403];function=represses a minimal TATA promoter and a strong SV40 promoter by Sharpr-MPRA in K562 cells {active_cell/tissue: K562};gbkey=regulatory;regulatory_class=silencer
ignored: chr1   RefSeqFE    biological_region   567489  567783  .   +.  ID=id-GeneID:121967041-2;Dbxref=GeneID:121967041;Name=biological region;gbkey=Region;standard_name=Sharpr-MPRA regulatory region 5852
ignored: chr1   BestRefSeq  primary_transcript  567995  568065  .   -.  ID=rna-NR_162149.1;Parent=gene-MIR12136;Dbxref=GeneID:113219467,Genbank:NR_162149.1,HGNC:HGNC:54043,miRBase:MI0039740;Name=NR_162149.1;gbkey=precursor_RNA;gene=MIR12136;product=microRNA 12136;transcript_id=NR_162149.1
ignored: chr1   BestRefSeq  miRNA   568048  568065  .   -   .   ID=rna-MIR12136;Parent=rna-NR_162149.1;Dbxref=GeneID:113219467,miRBase:MIMAT0049032,HGNC:HGNC:54043,miRBase:MI0039740;gbkey=ncRNA;gene=MIR12136;product=hsa-miR-12136
ignored: chr1   Curated Genomic pseudogene  568137  568818  .   +   .ID=gene-MTCO2P12;Dbxref=GeneID:107075310,HGNC:HGNC:52028;Name=MTCO2P12;description=MT-CO2 pseudogene 12;gbkey=Gene;gene=MTCO2P12;gene_biotype=pseudogene;gene_synonym=COII,COX2,COXII,MT-CO2,MTCO2;pseudo=true
ignored: chr1   Curated Genomic pseudogene  568915  569121  .   +   .ID=gene-MTATP8P1;Dbxref=GeneID:106480795,HGNC:HGNC:44571;Name=MTATP8P1;description=MT-ATP8 pseudogene 1;gbkey=Gene;gene=MTATP8P1;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  569076  569756  .   +   .ID=gene-MTATP6P1;Dbxref=GeneID:106480796,HGNC:HGNC:44575;Name=MTATP6P1;description=MT-ATP6 pseudogene 1;gbkey=Gene;gene=MTATP6P1;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  569756  570302  .   +   .ID=gene-MTCO3P12;Dbxref=GeneID:107075270,HGNC:HGNC:52042;Name=MTCO3P12;description=MT-CO3 pseudogene 12;gbkey=Gene;gene=MTCO3P12;gene_biotype=pseudogene;pseudo=true
ignored: chr1   Curated Genomic pseudogene  610123  610626  .   +   .ID=gene-WBP1LP6;Dbxref=GeneID:106481797,HGNC:HGNC:43956;Name=WBP1LP6;description=WBP1L pseudogene 6;gbkey=Gene;gene=WBP1LP6;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  mRNA    621096  622034  .   -   .   ID=rna-NM_001005277.1;Parent=gene-OR4F16;Dbxref=GeneID:81399,Genbank:NM_001005277.1,HGNC:HGNC:15079;Name=NM_001005277.1;gbkey=mRNA;gene=OR4F16;product=olfactory receptor family 4 subfamily F member 16;tag=RefSeq Select;transcript_id=NM_001005277.1
ignored: chr1   Curated Genomic pseudogene  653001  655582  .   -   .ID=gene-SEPTIN14P14;Dbxref=GeneID:107105354,HGNC:HGNC:51701;Name=SEPTIN14P14;description=septin 14 pseudogene 14;gbkey=Gene;gene=SEPTIN14P14;gene_biotype=pseudogene;gene_synonym=SEPT14P14;pseudo=true
ignored: chr1   Curated Genomic pseudogene  657426  661202  .   +   .ID=gene-CICP3;Dbxref=GeneID:100132630,HGNC:HGNC:37742;Name=CICP3;description=capicua transcriptional repressor pseudogene 3;gbkey=Gene;gene=CICP3;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 661139  714014  .   -   .   ID=rna-NR_168328.1;Parent=gene-LOC100288069;Dbxref=GeneID:100288069,Genbank:NR_168328.1;Name=NR_168328.1;gbkey=ncRNA;gene=LOC100288069;product=uncharacterized LOC100288069;transcript_id=NR_168328.1
ignored: chr1   Curated Genomic pseudogene  693613  693716  .   -   .ID=gene-RNU6-1199P;Dbxref=GeneID:106480091,HGNC:HGNC:48162;Name=RNU6-1199P;description=RNA%2C U6 small nuclear 1199%2C pseudogene;gbkey=Gene;gene=RNU6-1199P;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 752751  755214  .   +   .   ID=rna-NR_103536.1;Parent=gene-FAM87B;Dbxref=GeneID:400728,Genbank:NR_103536.1,HGNC:HGNC:32236;Name=NR_103536.1;gbkey=ncRNA;gene=FAM87B;product=family with sequence similarity 87 member B;transcript_id=NR_103536.1
ignored: chr1   BestRefSeq  lnc_RNA 761586  762902  .   -   .   ID=rna-NR_024321.1;Parent=gene-LINC00115;Dbxref=GeneID:79854,Genbank:NR_024321.1,HGNC:HGNC:26211;Name=NR_024321.1;gbkey=ncRNA;gene=LINC00115;product=long intergenic non-protein coding RNA 115;transcript_id=NR_024321.1
ignored: chr1   BestRefSeq  lnc_RNA 762971  794826  .   +   .   ID=rna-NR_047519.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047519.1,HGNC:HGNC:49377;Name=NR_047519.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 1;transcript_id=NR_047519.1
ignored: chr1   BestRefSeq  lnc_RNA 762971  794826  .   +   .   ID=rna-NR_047521.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047521.1,HGNC:HGNC:49377;Name=NR_047521.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 3;transcript_id=NR_047521.1
ignored: chr1   BestRefSeq  lnc_RNA 762971  794826  .   +   .   ID=rna-NR_047523.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047523.1,HGNC:HGNC:49377;Name=NR_047523.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 5;transcript_id=NR_047523.1
ignored: chr1   BestRefSeq  lnc_RNA 762971  794826  .   +   .   ID=rna-NR_047524.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047524.1,HGNC:HGNC:49377;Name=NR_047524.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 7;transcript_id=NR_047524.1
ignored: chr1   BestRefSeq  lnc_RNA 762971  778984  .   +   .   ID=rna-NR_047526.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047526.1,HGNC:HGNC:49377;Name=NR_047526.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 9;transcript_id=NR_047526.1
ignored: chr1   BestRefSeq  lnc_RNA 763178  794826  .   +   .   ID=rna-NR_047525.1;Parent=gene-LINC01128;Dbxref=GeneID:643837,Genbank:NR_047525.1,HGNC:HGNC:49377;Name=NR_047525.1;gbkey=ncRNA;gene=LINC01128;product=long intergenic non-protein coding RNA 1128%2C transcript variant 8;transcript_id=NR_047525.1
ignored: chr1   BestRefSeq  lnc_RNA 803451  812182  .   -   .   ID=rna-NR_027055.1;Parent=gene-FAM41C;Dbxref=GeneID:284593,Genbank:NR_027055.1,HGNC:HGNC:27635;Name=NR_027055.1;gbkey=ncRNA;gene=FAM41C;product=family with sequence similarity 41 member C;transcript_id=NR_027055.1
ignored: chr1   Curated Genomic pseudogene  808671  809894  .   +   .ID=gene-TUBB8P11;Dbxref=GeneID:388579,HGNC:HGNC:32337;Name=TUBB8P11;description=tubulin beta 8 class VIII pseudogene 11;gbkey=Gene;gene=TUBB8P11;gene_biotype=pseudogene;pseudo=true
ignored: chr1   BestRefSeq  lnc_RNA 852198  855072  .   -   .   ID=rna-NR_122045.1;Parent=gene-LINC02593;Dbxref=GeneID:100130417,Genbank:NR_122045.1,HGNC:HGNC:53933;Name=NR_122045.1;gbkey=ncRNA;gene=LINC02593;product=long intergenic non-protein coding RNA 2593%2C transcript variant 2;transcript_id=NR_122045.1
ignored: chr1   BestRefSeq  lnc_RNA 852198  855072  .   -   .   ID=rna-NR_026874.2;Parent=gene-LINC02593;Dbxref=GeneID:100130417,Genbank:NR_026874.2,HGNC:HGNC:53933;Name=NR_026874.2;gbkey=ncRNA;gene=LINC02593;product=long intergenic non-protein coding RNA 2593%2C transcript variant 1;transcript_id=NR_026874.2
ignored: chr1   BestRefSeq  lnc_RNA 855623  860984  .   -   .   ID=rna-NR_168405.1;Parent=gene-LOC107985728;Dbxref=GeneID:107985728,Genbank:NR_168405.1;Name=NR_168405.1;gbkey=ncRNA;gene=LOC107985728;product=collagen alpha-1(III) chain-like;transcript_id=NR_168405.1
ignored: chr1   BestRefSeq  mRNA    859303  879954  .   +   .   ID=rna-NM_001385640.1;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_001385640.1,HGNC:HGNC:28706,MIM:616765;Name=NM_001385640.1;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 2;transcript_id=NM_001385640.1
ignored: chr1   BestRefSeq  mRNA    859303  879954  .   +   .   ID=rna-NM_001385641.1;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_001385641.1,HGNC:HGNC:28706,MIM:616765;Name=NM_001385641.1;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 1;tag=RefSeq Select;transcript_id=NM_001385641.1
ignored: chr1   BestRefSeq  mRNA    861111  879954  .   +   .   ID=rna-NM_152486.4;Parent=gene-SAMD11;Dbxref=GeneID:148398,Genbank:NM_152486.4,HGNC:HGNC:28706,MIM:616765;Name=NM_152486.4;gbkey=mRNA;gene=SAMD11;product=sterile alpha motif domain containing 11%2C transcript variant 3;transcript_id=NM_152486.4
ignored: chr1   BestRefSeq  mRNA    879583  894636  .   -   .   ID=rna-NM_015658.4;Parent=gene-NOC2L;Dbxref=GeneID:26155,Genbank:NM_015658.4,HGNC:HGNC:24517,MIM:610770;Name=NM_015658.4;gbkey=mRNA;gene=NOC2L;product=NOC2 like nucleolar associated transcriptional repressor;tag=RefSeq Select;transcript_id=NM_015658.4
ignored: chr1   BestRefSeq  mRNA    895964  901099  .   +   .   ID=rna-NM_198317.3;Parent=gene-KLHL17;Dbxref=GeneID:339451,Genbank:NM_198317.3,HGNC:HGNC:24023,MIM:619262;Name=NM_198317.3;gbkey=mRNA;gene=KLHL17;product=kelch like family member 17;tag=RefSeq Select;transcript_id=NM_198317.3
ignored: chr1   BestRefSeq  mRNA    901862  911245  .   +   .   ID=rna-NM_001367552.1;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_001367552.1,HGNC:HGNC:25284;Name=NM_001367552.1;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 3;transcript_id=NM_001367552.1
ignored: chr1   BestRefSeq  mRNA    901862  911245  .   +   .   ID=rna-NM_001160184.2;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_001160184.2,HGNC:HGNC:25284;Name=NM_001160184.2;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 2;transcript_id=NM_001160184.2
ignored: chr1   BestRefSeq  mRNA    901862  911245  .   +   .   ID=rna-NM_032129.3;Parent=gene-PLEKHN1;Dbxref=GeneID:84069,Genbank:NM_032129.3,HGNC:HGNC:25284;Name=NM_032129.3;gbkey=mRNA;gene=PLEKHN1;product=pleckstrin homology domain containing N1%2C transcript variant 1;tag=RefSeq Select;transcript_id=NM_032129.3
ignored: chr1   BestRefSeq  mRNA    910578  917473  .   -   .   ID=rna-NM_001291366.2;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001291366.2,HGNC:HGNC:28208,MIM:615921;Name=NM_001291366.2;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 1;transcript_id=NM_001291366.2
ignored: chr1   BestRefSeq  mRNA    910578  917473  .   -   .   ID=rna-NM_001291367.2;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001291367.2,HGNC:HGNC:28208,MIM:615921;Name=NM_001291367.2;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 2;transcript_id=NM_001291367.2
ignored: chr1   BestRefSeq  mRNA    910578  917473  .   -   .   ID=rna-NM_001394713.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001394713.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001394713.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 5;tag=RefSeq Select;transcript_id=NM_001394713.1
ignored: chr1   BestRefSeq  mRNA    910578  917473  .   -   .   ID=rna-NM_001369897.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001369897.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001369897.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 3;transcript_id=NM_001369897.1
ignored: chr1   BestRefSeq  mRNA    910578  916553  .   -   .   ID=rna-NM_001369898.1;Parent=gene-PERM1;Dbxref=GeneID:84808,Genbank:NM_001369898.1,HGNC:HGNC:28208,MIM:615921;Name=NM_001369898.1;gbkey=mRNA;gene=PERM1;product=PPARGC1 and ESRR induced regulator%2C muscle 1%2C transcript variant 4;transcript_id=NM_001369898.1
ignored: chr1   BestRefSeq  mRNA    934344  935477  .   -   .   ID=rna-NM_021170.4;Parent=gene-HES4;Dbxref=GeneID:57801,Genbank:NM_021170.4,HGNC:HGNC:24149,MIM:608060;Name=NM_021170.4;gbkey=mRNA;gene=HES4;product=hes family bHLH transcription factor 4%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_021170.4
ignored: chr1   BestRefSeq  mRNA    934344  935477  .   -   .   ID=rna-NM_001142467.2;Parent=gene-HES4;Dbxref=GeneID:57801,Genbank:NM_001142467.2,HGNC:HGNC:24149,MIM:608060;Name=NM_001142467.2;gbkey=mRNA;gene=HES4;product=hes family bHLH transcription factor 4%2C transcript variant 1;transcript_id=NM_001142467.2
ignored: chr1   Curated Genomic pseudogene  943440  943670  .   -   .ID=gene-RPL39P12;Dbxref=GeneID:100270829,HGNC:HGNC:36166;Name=RPL39P12;description=ribosomal protein L39 pseudogene 12;gbkey=Gene;gene=RPL39P12;gene_biotype=pseudogene;gene_synonym=RPL39_1_3;pseudo=true
ignored: chr1   BestRefSeq  mRNA    948877  949920  .   +   .   ID=rna-NM_005101.4;Parent=gene-ISG15;Dbxref=GeneID:9636,Genbank:NM_005101.4,HGNC:HGNC:4053,MIM:147571;Name=NM_005101.4;gbkey=mRNA;gene=ISG15;product=ISG15 ubiquitin like modifier;tag=RefSeq Select;transcript_id=NM_005101.4
ignored: chr1   BestRefSeq  mRNA    955500  991496  .   +   .   ID=rna-NM_001305275.2;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_001305275.2,HGNC:HGNC:329,MIM:103320;Name=NM_001305275.2;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 1;transcript_id=NM_001305275.2
ignored: chr1   BestRefSeq  mRNA    955500  991496  .   +   .   ID=rna-NM_198576.4;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_198576.4,HGNC:HGNC:329,MIM:103320;Name=NM_198576.4;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_198576.4
ignored: chr1   BestRefSeq  mRNA    969373  991496  .   +   .   ID=rna-NM_001364727.2;Parent=gene-AGRN;Dbxref=GeneID:375790,Genbank:NM_001364727.2,HGNC:HGNC:329,MIM:103320;Name=NM_001364727.2;gbkey=mRNA;gene=AGRN;product=agrin%2C transcript variant 3;transcript_id=NM_001364727.2
ignored: chr1   BestRefSeq  lnc_RNA 995114  1001833 .   +   .   ID=rna-NR_148960.1;Parent=gene-LOC100288175;Dbxref=GeneID:100288175,Genbank:NR_148960.1;Name=NR_148960.1;gbkey=ncRNA;gene=LOC100288175;product=uncharacterized LOC100288175;transcript_id=NR_148960.1
ignored: chr1   BestRefSeq  lnc_RNA 1001015 1004717 .   -   .   ID=rna-NR_168432.1;Parent=gene-LOC105378948;Dbxref=GeneID:105378948,Genbank:NR_168432.1;Name=NR_168432.1;gbkey=ncRNA;gene=LOC105378948;product=uncharacterized LOC105378948%2C transcript variant 1;transcript_id=NR_168432.1
ignored: chr1   BestRefSeq  lnc_RNA 1001015 1004717 .   -   .   ID=rna-NR_168433.1;Parent=gene-LOC105378948;Dbxref=GeneID:105378948,Genbank:NR_168433.1;Name=NR_168433.1;gbkey=ncRNA;gene=LOC105378948;product=uncharacterized LOC105378948%2C transcript variant 2;transcript_id=NR_168433.1
ignored: chr1   RefSeqFE    enhancer    1004592 1004706 .   .   .ID=id-GeneID:106783496;Dbxref=GeneID:106783496;experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:17135569];function=enhancer in Jurkat T cells;gbkey=regulatory;regulatory_class=enhancer
ignored: chr1   RefSeqFE    biological_region   1004592 1004706 .   +.  ID=id-GeneID:106783496-2;Dbxref=GeneID:106783496;Name=biological region;gbkey=Region;standard_name=conserved acetylation island sequence 30 enhancer
ignored: chr1   BestRefSeq  mRNA    1006347 1009686 .   -   .   ID=rna-NM_001205252.2;Parent=gene-RNF223;Dbxref=GeneID:401934,Genbank:NM_001205252.2,HGNC:HGNC:40020;Name=NM_001205252.2;gbkey=mRNA;gene=RNF223;product=ring finger protein 223;tag=RefSeq Select;transcript_id=NM_001205252.2
ignored: chr1   BestRefSeq  mRNA    1017203 1051469 .   -   .   ID=rna-NM_017891.5;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_017891.5,HGNC:HGNC:26062;Name=NM_017891.5;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 2;tag=RefSeq Select;transcript_id=NM_017891.5
ignored: chr1   BestRefSeq  mRNA    1017203 1051469 .   -   .   ID=rna-NM_001363525.2;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_001363525.2,HGNC:HGNC:26062;Name=NM_001363525.2;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 3;transcript_id=NM_001363525.2
ignored: chr1   BestRefSeq  mRNA    1017203 1051469 .   -   .   ID=rna-NM_001330306.2;Parent=gene-C1orf159;Dbxref=GeneID:54991,Genbank:NM_001330306.2,HGNC:HGNC:26062;Name=NM_001330306.2;gbkey=mRNA;gene=C1orf159;product=chromosome 1 open reading frame 159%2C transcript variant 1;transcript_id=NM_001330306.2
ignored: chr1   RefSeqFE    enhancer    1047889 1048183 .   .   .ID=id-GeneID:112577469;Dbxref=GeneID:112577469;Note=tiled region #13538%3B HepG2 Activating non-DNase unmatched - State 18:Pol2%3B and K562 Activating DNase matched - State 14:Gen5';experiment=EXISTENCE:reporter gene assay evidence [ECO:0000049][PMID:27701403];function=activates a minimal TATA promoter and a strong SV40 promoter by Sharpr-MPRA in HepG2 and K562 cells;gbkey=regulatory;regulatory_class=enhancer
ignored: chr1   RefSeqFE    biological_region   1047889 1048183 .   +.  ID=id-GeneID:112577469-2;Dbxref=GeneID:112577469;Name=biological region;gbkey=Region;standard_name=Sharpr-MPRA regulatory region 13538
Unknown strand: . .. .  .   ID=id-GeneID:115801415;Dbxref=GeneID:115801415;Note=candidate enhancer chr1.97 targeted for multiplex CRISPR interference;experiment=EXISTENCE:mutant phenotype evidence [ECO:0000015][PMID:30612741];function=positively regulates NADK gene expression in K562 cells;gbkey=regulatory;regulatory_class=transcriptional_cis_regulatory_region
pd3 commented 1 year ago

Just to confirm, I was able to reproduce the problem now, previously most likely ended up testing a different GFF file.

pd3 commented 1 year ago

I just pushed a commit which should make parsing of your GFF possible. Please try it out

marissa97 commented 1 year ago

Hey! Sorry for the late reply, i just have time to try it today. I installed bcftools through http://www.htslib.org/download/. How can I tested your commit? I have clone your repositories as written http://samtools.github.io/bcftools/howtos/install.html , however it still calls the bcftools from the first download.

pd3 commented 1 year ago

If your clone is in /path/to/bcftools-git-clone and the compilation was successful, there will be a bcftools binary in that directory. You can run it as /path/to/bcftools-git-clone/bcftools.

marissa97 commented 1 year ago

Hello, now i am getting a new error message:

Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1   BestRefSeq  transcript  11874   14409   .   +   .   ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1    RefSeq  cDNA_match  1588825 1588948 .   -   .   ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)?

       Use the --force option to override (at your own risk).

When i use --force:

Parsing /home/marissa/Downloads/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1   BestRefSeq  transcript  11874   14409   .   +   .   ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1    RefSeq  cDNA_match  1588825 1588948 .   -   .   ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Warning: GFF contains overlapping CDS rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994.
Indexed 67150 transcripts, 767984 exons, 691224 CDSs, 0 UTRs
Warning: 33328 warnings were supressed, run with `--verbose 2` to see them all
Calling...
Error: the fasta reference does not match the VCF REF allele at chr1:111957501 .. fasta=C vcf=*

I used fasta ref from ncbi as well hg19. This is the position that cause the error (annotated using ensembl gff3): chr1 111957501 rs201350653 CTCACAGACTGATGACTCACAGGGG C 3027.6 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.273;DB;DP=71;ExcessHet=3.0103;FS=2.269;MLEAC=1;MLEAF=0.5;MQ=60.0;MQRankSum=0.0;QD=30.56;ReadPosRankSum=2.7;SOR=1.104;BCSQ=inframe_deletion|OVGP1|ENST00000369732|protein_coding|-|533TPVSHQSVS>518S|111957501CTCACAGACTGATGACTCACAGGGG>C;GI=OVGP1;TI=NM_002557;FC=Deletion_S533S GT:AD:DP:GQ:PL:VF:GQX:BCSQ 0/1:42,21:63:99:3035,0,6479:0.333333:99:3

pd3 commented 1 year ago

Unfortunately this does not give enough information, it also confuses me that you're showing an annotated line, so where the error comes from? Can you please provide a small test case, the reference, GFF and the VCF, command line, the exact bcftools version (bcftools --version), best in a separate issue?

marissa97 commented 1 year ago

Hallo, I tried to upload it here, but the test size is too big. Can i share it via email or something else ? Yeah, I understand the confusion. I tried to take the example on how it works with gff3 from ensemble.

pd3 commented 1 year ago

Best if you can provide a link for us to download? As a last resort you could use a temporary storage, for example somewhere like https://wetransfer.com/

marissa97 commented 1 year ago

Here is the link to the test.vcf, fasta refseq and gff3 we use: https://we.tl/t-plVJprFyWE

It works now with --force, but without there is still the same error: Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)? Use the --force option to override (at your own risk).

gff3 files are gained from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.25/ (download -> refseq only -> gff -> change the seqid to chr*)

bcftools used are the one from: /path/to/bcftools-git-clone/bcftools (version: bcftools 1.17-43-g875c067b)

pd3 commented 1 year ago

I've got the test data, thank you. Can you show also the exact command you are running and the error you are geting? For me this runs okay:

bcftools csq -g genomic_chr.gff3 -f hg19GenRef.fa test.vcf -o /dev/null -l 
marissa97 commented 1 year ago

The code is: bcftools/bcftools csq test.vcf -g genomic_chr.gff3 -f hg19GenRef.fa -o new_test.vcf (without -l)

I tested with -l, it produces the same error message I tried the exact code as yours, it produces the same error message as well

The error message: Parsing Documents/test_file/genomic_chr.gff3 ... Warning: Ignoring transcript with unknown biotype .. chr1 BestRefSeq transcript 11874 14409 . + . ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2 Warning: The GFF contains features with duplicate id .. chr1 RefSeq cDNA_match 1588825 1588948 . - . ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32 Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common) Error: CDS overlap in the transcript rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994, is this intended (e.g. ribosomal slippage)? Use the --force option to override (at your own risk).

pd3 commented 1 year ago

Okay, I misunderstood: you were getting an error only when the --force option was not given but it worked otherwise.

That was the expected behavior because at the time of writing the code I did not have enough GFFs to test overlapping coding exons within the same transcript. Therefore the program would exit with an error and give a hint for the user to check manually if this was due to a ribosomal slippage. (This is usually described in the GFF annotations.)

There is a new commit now https://github.com/samtools/bcftools/commit/a8249495a6b8360bb8a0e447da0cc0a372a98a46 which makes overlapping CDS into a warning instead of an error.

I consider this problem fully resolved now, please open a new issue if anything else crops up.

marissa97 commented 1 year ago

No, I get an error with and without --force as well

pd3 commented 1 year ago

Then please show what the output looks like. I only see the ouput without the --force option and I was not able to reproduce the error with your data myself.

Also above you say "It works now with --force, but without there is still the same error". I am very confused.

marissa97 commented 1 year ago

Is the new commit pushed? When i tried to pull, it said, that it is all up to date.

With --force appears warnings and the output is empty.

Warnings:

Parsing Documents/test_file/genomic_chr.gff3 ...
Warning: Ignoring transcript with unknown biotype .. chr1   BestRefSeq  transcript  11874   14409   .   +   .   ID=rna-NR_046018.2;Parent=gene-DDX11L1;Dbxref=GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102;Name=NR_046018.2;gbkey=misc_RNA;gene=DDX11L1;product=DEAD/H-box helicase 11 like 1 (pseudogene);pseudo=true;transcript_id=NR_046018.2
Warning: The GFF contains features with duplicate id .. chr1    RefSeq  cDNA_match  1588825 1588948 .   -   .   ID=f3da58d96790abfb3974185d5b339f6d;Target=NM_001787.3 92 215 +;consensus_splices=32;exon_identity=0.993858;for_remapping=2;gap_count=2;identity=0.907621;idty=1;matches=2751;num_ident=2751;num_mismatch=3;pct_coverage=90.8611;pct_coverage_hiqual=90.8611;pct_identity_gap=99.3858;pct_identity_ungap=99.8911;product_coverage=0.91323;rank=1;splices=32
Note: truncated transcript rna-NM_001306072.3 with incomplete CDS (this is very common)
Warning: GFF contains overlapping CDS rna-NM_001172437.2: 94292646-94293825 and 94293825-94294994.
Indexed 67150 transcripts, 767984 exons, 691224 CDSs, 0 UTRs
Warning: 33328 warnings were supressed, run with --verbose 2 to see them all
Calling...
Unphased heterozygous genotype at 1:111957501, sample test. See the --phase option.
pd3 commented 1 year ago

The program is complaining about unphased genotypes and asks to specify what to do with them

Unphased heterozygous genotype at 1:111957501, sample test. See the --phase option.

Hence the -l used in the command in my example above https://github.com/samtools/bcftools/issues/1898#issuecomment-1563962316.

Note I reformatted your comment so that the final line stands out, the github markdown ignores newlines otherwise.

marissa97 commented 1 year ago

It works now! Thank you for your cooperation and support!