openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 66 forks source link

Missing scaffold genes #233

Open gokceneraslan opened 4 years ago

gokceneraslan commented 4 years ago

I use pyensembl for gene id/name mapping but lately I noticed that some ensembl gene ids (e.g. ENSG00000285395) are missing in pyensembl.EnsemblRelease(97).genes().

That's I think because pyensembl is using {species}.{reference}.{release}.gtf.gz GTF URL template instead of {species}.{reference}.{release}.chr_patch_hapl_scaff.gtf.gz (see https://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/ for comparison).

I understand that this file includes genes that are not mapped to chromosomes, so might be problematic in the context of pyensembl, but still, it'd be more complete to include all the genes.

iskandr commented 4 years ago

Thanks for pointing this out. Do you think it's OK to just include the scaffold names as if they were chromosomes or do they need further special treatment?

Also, do you think that their inclusion in the results of genes() should be the default or optional?

gokceneraslan commented 4 years ago

I think it's fine to include them as if they are chromosome names. I'd expect genes() to return all genes by default.

iskandr commented 4 years ago

So it looks like the chr_patch_hap1_scaff files started getting added in release 82 (https://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/) but aren't present for release 81 or earlier. I guess I could make the URL building logic know to use .gtf.gz for older releases and .chr_patch_hapl_scaff.gtf.gzfor newer ones.

gokceneraslan commented 4 years ago

or you can check if it exists and fallback to original gtf if not without making the code release-specific.

fabianegli commented 4 years ago

I would say it is desirable to have specific logic in the code to cope with release versions of databases since they can change more than once which can result in a failure!=failure scenario.

Another reason to use database version specific path pattern hardcoding is that they are not expected to change and thus the hardcoding of path patterns serves a documentation purpose, too.

The last point would be performance - a try-except block might be slightly faster but loading a file is much slower than a case switch and/or path generation and thus I expect the performance difference of different selection methods to be negligible.