Open gokceneraslan opened 4 years ago
Thanks for pointing this out. Do you think it's OK to just include the scaffold names as if they were chromosomes or do they need further special treatment?
Also, do you think that their inclusion in the results of genes()
should be the default or optional?
I think it's fine to include them as if they are chromosome names. I'd expect genes() to return all genes by default.
So it looks like the chr_patch_hap1_scaff
files started getting added in release 82 (https://ftp.ensembl.org/pub/release-82/gtf/homo_sapiens/) but aren't present for release 81 or earlier. I guess I could make the URL building logic know to use .gtf.gz
for older releases and .chr_patch_hapl_scaff.gtf.gz
for newer ones.
or you can check if it exists and fallback to original gtf if not without making the code release-specific.
I would say it is desirable to have specific logic in the code to cope with release versions of databases since they can change more than once which can result in a failure!=failure scenario.
Another reason to use database version specific path pattern hardcoding is that they are not expected to change and thus the hardcoding of path patterns serves a documentation purpose, too.
The last point would be performance - a try-except block might be slightly faster but loading a file is much slower than a case switch and/or path generation and thus I expect the performance difference of different selection methods to be negligible.
I use pyensembl for gene id/name mapping but lately I noticed that some ensembl gene ids (e.g. ENSG00000285395) are missing in
pyensembl.EnsemblRelease(97).genes()
.That's I think because pyensembl is using
{species}.{reference}.{release}.gtf.gz
GTF URL template instead of{species}.{reference}.{release}.chr_patch_hapl_scaff.gtf.gz
(see https://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/ for comparison).I understand that this file includes genes that are not mapped to chromosomes, so might be problematic in the context of pyensembl, but still, it'd be more complete to include all the genes.