shenkers / isoscm

Transcript assembly tool using multiple change-point inference to improve 3'UTR annotation
13 stars 7 forks source link

Diff error using ensembl GTF #18

Closed mplass closed 8 years ago

mplass commented 8 years ago

Hi, I'm trying to compare the results from compare to the Ensembl annotation (Ensembl78). The program works perfectly fine when I use one of the GTF files produced by IsoSCM, so it is a formatting issue of the GTF file. However, I can't find the GTF format specifications anywhere. Executed Command: java -Xmx2048m -jar IsoSCM-2.0.11.jar diff -x compare_parameters.xml -G Homo_sapiens.GRCh38.78.chr.gtf

Error: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8 at tools.ParseGTF$TranscriptIterator.next(ParseGTF.java:280) at tools.ParseGTF$TranscriptIterator.next(ParseGTF.java:1) at tools.GTFTools$AnnotationParser.(GTFTools.java:74) at processing.DiffReference.diff(DiffReference.java:48) at executable.IsoSCM.main(IsoSCM.java:622)

Thanks!

shenkers commented 8 years ago

Yes, this seems to be a formatting issue. The columns from the Ensembl GTFs look like this (numbered from zero)

0 chrGL000213.1 1 protein_coding 2 exon 3 138767 4 139339 5 . 6 - 7 . 8 gene_id "ENSG00000237375"; transcript_id "ENST00000327822"; exon_number "1"; gene_name "BX072566.1"; gene_biotype "protein_coding"; transcript_name "BX072566.1-201";

In terms of format, it's the last column that varies between different groups that produce GTF files. IsoSCM expects attributes to be separated by semi-colons, the attribute id to be separated from the attribute value by a space, and the attribute values to be quoted. IsoSCM will only analyze features that have an associated gene_id and transcript_id attributes.

The error message says that it reached a line in the GTF file that doesn't have the last column. I usually download the GTFs from this site http://useast.ensembl.org/info/data/ftp/index.html , is this where you downloaded the Ensembl78 GTF from?

mplass commented 8 years ago

I just downloaded from ensembl ftp site http://www.ensembl.org/info/data/ftp/index.html

On 02/17/2016 04:18 PM, shenkers wrote:

Yes, this seems to be a formatting issue. The columns from the Ensembl GTFs look like this (numbered from zero)

0 chrGL000213.1 1 protein_coding 2 exon 3 138767 4 139339 5 . 6 - 7 . 8 gene_id "ENSG00000237375"; transcript_id "ENST00000327822"; exon_number "1"; gene_name "BX072566.1"; gene_biotype "protein_coding"; transcript_name "BX072566.1-201";

In terms of format, it's the last column that varies between different groups that produce GTF files. IsoSCM expects attributes to be separated by semi-colons, the attribute id to be separated from the attribute value by a space, and the attribute values to be quoted. IsoSCM will only analyze features that have an associated gene_id and transcript_id attributes.

The error message says that it reached a line in the GTF file that doesn't have the last column. I usually download the GTFs from this site http://useast.ensembl.org/info/data/ftp/index.html , is this where you downloaded the Ensembl78 GTF from?

— Reply to this email directly or view it on GitHub https://github.com/shenkers/isoscm/issues/18#issuecomment-185249585.

Mireya Plass, PhD

Systems Biology of Gene Regulatory Elements (Nikolaus Rajewsky lab) Max Delbrück Center for Molecular Medicine Robert-Rössle Str. 10 13092 Berlin, Germany Tel: +493094064248 e-mail:mireya.plassportulas@mdc-berlin.de

shenkers commented 8 years ago

Ah, I think it might be thrown off by the meta-data at the top. Do you still get the error if you delete these lines from the top of the file?

!genome-build GRCh38.p5

!genome-version GRCh38

!genome-date 2013-12

!genome-build-accession NCBI:GCA_000001405.20

!genebuild-last-updated 2015-10

mplass commented 8 years ago

you are right! Now it works fine.

Thanks.

On 02/18/2016 03:37 PM, shenkers wrote:

Ah, I think it might be thrown off by the meta-data at the top. Do you still get the error if you delete these lines from the top of the file?

!genome-build GRCh38.p5

!genome-version GRCh38

!genome-date 2013-12

!genome-build-accession NCBI:GCA_000001405.20

!genebuild-last-updated 2015-10

— Reply to this email directly or view it on GitHub https://github.com/shenkers/isoscm/issues/18#issuecomment-185748541.

Mireya Plass, PhD

Systems Biology of Gene Regulatory Elements (Nikolaus Rajewsky lab) Max Delbrück Center for Molecular Medicine Robert-Rössle Str. 10 13092 Berlin, Germany Tel: +493094064248 e-mail:mireya.plassportulas@mdc-berlin.de

shenkers commented 8 years ago

Great, I'll add an update so that these comment lines are ignored instead of causing an error