nf-core / test-datasets

Test data to be used for automated testing with the nf-core pipelines
https://nf-co.re
MIT License
93 stars 328 forks source link

error in gff file (rnaseq branch test-datasets/reference/genes.gff) #117

Open Juke34 opened 4 years ago

Juke34 commented 4 years ago

We found a problem in the gff file you have as test.

I   ensembl transcript  335 649 .   +   .   ID=YAL069W;Parent=YAL069W;geneID=YAL069W;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I   ensembl exon    335 649 .   +   .   Parent=YAL069W;exon_id=YAL069W.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I   ensembl CDS 335 649 .   +   0   Parent=YAL069W;exon_number=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;protein_id=YAL069W;protein_version=1;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129

ID and Parent attributes of transcript features have same IDs. This is not allowed in GFF3 specifications.

We use AGAT that deals with that problem by automatically updating the parent ID to be uniq. Using this file to test/build pipelines might be problematic. This should be updated.

Juke34 commented 4 years ago

I can provide you a fixed version of this file if you wish

Juke34 commented 4 years ago

We also found an awkward transcript at the end of the file that has a CDS of 3 nucleotides long.

I   ensembl transcript  224563  224862  .   -   .   ID=YAR070C;Parent=YAR070C;geneID=YAR070C;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435
I   ensembl exon    224563  224862  .   -   .   Parent=YAR070C;exon_id=YAR070C.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435
I   ensembl CDS 224563  224565  .   -   0   Parent=YAR070C;exon_number=1;gene_biotype=protein_coding;gene_name=YAR070C;gene_source=ensembl;gene_version=1;p_id=P48;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS435

Is it something normal? It sounds wrong. Maybe it should be removed.