error in gff file (rnaseq branch test-datasets/reference/genes.gff)

Juke34 commented 4 years ago

We found a problem in the gff file you have as test.

rnaseq
test-datasets/reference/genes.gff

I   ensembl transcript  335 649 .   +   .   ID=YAL069W;Parent=YAL069W;geneID=YAL069W;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I   ensembl exon    335 649 .   +   .   Parent=YAL069W;exon_id=YAL069W.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I   ensembl CDS 335 649 .   +   0   Parent=YAL069W;exon_number=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;protein_id=YAL069W;protein_version=1;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129

ID and Parent attributes of transcript features have same IDs. This is not allowed in GFF3 specifications.

We use AGAT that deals with that problem by automatically updating the parent ID to be uniq. Using this file to test/build pipelines might be problematic. This should be updated.

Juke34 commented 4 years ago

I can provide you a fixed version of this file if you wish

Juke34 commented 4 years ago

We also found an awkward transcript at the end of the file that has a CDS of 3 nucleotides long.

I   ensembl transcript  224563  224862  .   -   .   ID=YAR070C;Parent=YAR070C;geneID=YAR070C;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435
I   ensembl exon    224563  224862  .   -   .   Parent=YAR070C;exon_id=YAR070C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435
I   ensembl CDS 224563  224565  .   -   0   Parent=YAR070C;exon_number=1;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435

Is it something normal? It sounds wrong. Maybe it should be removed.

nf-core / test-datasets

error in gff file (rnaseq branch test-datasets/reference/genes.gff) #117