openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

A couple of errors encountered with version 2.3.0 (but not with 2.2.9) in `collect_all_installed_ensembl_releases` and `Genome.index` functions #299

Closed rraadd88 closed 6 months ago

rraadd88 commented 6 months ago

Hi,

1. When I ran collect_all_installed_ensembl_releases function i.e.

from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()

I encountered following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 2
      1 from pyensembl.shell import collect_all_installed_ensembl_releases
----> 2 collect_all_installed_ensembl_releases()

File path/to//python3.9/site-packages/pyensembl/shell.py:157, in collect_all_installed_ensembl_releases()
    155 genomes = []
    156 for species, release in Species.all_species_release_pairs():
--> 157     genome = EnsemblRelease(release, species=species)
    158     if genome.required_local_files_exist():
    159         genomes.append(genome)

File path/to//python3.9/site-packages/pyensembl/ensembl_release.py:66, in EnsemblRelease.__init__(self, release, species, server)
     63 def __init__(
     64     self, release=MAX_ENSEMBL_RELEASE, species=human, server=ENSEMBL_FTP_SERVER
     65 ):
---> 66     self.release, self.species, self.server = self.normalize_init_values(
     67         release=release, species=species, server=server
     68     )
     70     self.gtf_url = make_gtf_url(
     71         ensembl_release=self.release, species=self.species, server=self.server
     72     )
     74     self.transcript_fasta_urls = [
     75         make_fasta_url(
     76             ensembl_release=self.release,
   (...)
     86         ),
     87     ]

File path/to//python3.9/site-packages/pyensembl/ensembl_release.py:38, in EnsemblRelease.normalize_init_values(cls, release, species, server)
     32 @classmethod
     33 def normalize_init_values(cls, release, species, server):
     34     """
     35     Normalizes the arguments which uniquely specify an EnsemblRelease
     36     genome.
     37     """
---> 38     release = check_release_number(release)
     39     species = check_species_object(species)
     40     return (release, species, server)

File path/to//python3.9/site-packages/pyensembl/ensembl_release_versions.py:28, in check_release_number(release)
     25     raise ValueError("Invalid Ensembl release: %s" % release)
     27 if release < MIN_ENSEMBL_RELEASE:
---> 28     raise ValueError(
     29         "Invalid Ensembl releases %d, must be greater than %d"
     30         % (release, MIN_ENSEMBL_RELEASE)
     31     )
     32 return release

ValueError: Invalid Ensembl releases 47, must be greater than 54

2. When I ran Genome.index function with custom files i.e.

from pyensembl import Genome
annots = Genome(
    reference_name='ref',
    annotation_name='ann',
    gtf_path_or_url=gtf_path,
    transcript_fasta_paths_or_urls=transcript_path,
    protein_fasta_paths_or_urls=protein_path,
)
# parse GTF and construct database of genomic features
annots.index()

I encountered following error:

INFO:pyensembl.database:Creating database: path/to//ann.db
INFO:pyensembl.database:Reading GTF from path/to//ann.gtf
INFO:root:Extracted GTF attributes: ['gene_id', 'transcript_id', 'exon_number', 'gene_name', 'gene_source', 'gene_biotype', 'transcript_source', 'transcript_biotype', 'exon_id', 'protein_id']
thread '<unnamed>' panicked at py-polars/src/map/series.rs:219:19:
python function failed ValueError: Invalid strand: 4
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File path/to//python3.9/site-packages/polars/expr/expr.py:4296, in Expr.map_elements.<locals>.wrap_f(x)
   4294 with warnings.catch_warnings():
   4295     warnings.simplefilter("ignore", PolarsInefficientMapWarning)
-> 4296     return x.map_elements(
   4297         function, return_dtype=return_dtype, skip_nulls=skip_nulls
   4298     )

File path/to//python3.9/site-packages/polars/series/series.py:5262, in Series.map_elements(self, function, return_dtype, skip_nulls)
   5258     pl_return_dtype = py_type_to_dtype(return_dtype)
   5260 warn_on_inefficient_map(function, columns=[self.name], map_target="series")
   5261 return self._from_pyseries(
-> 5262     self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
   5263 )

PanicException: python function failed ValueError: Invalid strand: 4
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[7], line 10
      2 annots = Genome(
      3     reference_name='ref',
      4     annotation_name='ann',
   (...)
      7     protein_fasta_paths_or_urls=protein_path,
      8 )
      9 # parse GTF and construct database of genomic features
---> 10 annots.index()

File path/to//python3.9/site-packages/pyensembl/genome.py:280, in Genome.index(self, overwrite)
    274 """
    275 Assuming that all necessary data for this Genome has been downloaded,
    276 generate the GTF database and save efficient representation of
    277 FASTA sequence files.
    278 """
    279 if self.requires_gtf:
--> 280     self.db.connect_or_create(overwrite=overwrite)
    281 if self.requires_transcript_fasta:
    282     self.transcript_sequences.index(overwrite=overwrite)

File path/to//python3.9/site-packages/pyensembl/database.py:284, in Database.connect_or_create(self, overwrite)
    282     return connection
    283 else:
--> 284     return self.create(overwrite=overwrite)

File path/to//python3.9/site-packages/pyensembl/database.py:206, in Database.create(self, overwrite)
    203 logger.info("Creating database: %s", self.local_db_path)
    204 datacache.ensure_dir(self.cache_directory_path)
--> 206 df = self._load_gtf_as_dataframe(
    207     usecols=self.restrict_gtf_columns, features=self.restrict_gtf_features
    208 )
    209 all_index_groups = self._all_possible_indices(df.columns)
    211 if self.restrict_gtf_features:

File path/to//python3.9/site-packages/pyensembl/database.py:611, in Database._load_gtf_as_dataframe(self, usecols, features)
    607 """
    608 Parse this genome source's GTF file and load it as a Pandas DataFrame
    609 """
    610 logger.info("Reading GTF from %s", self.gtf_path)
--> 611 df = read_gtf(
    612     self.gtf_path,
    613     column_converters={
    614         "seqname": normalize_chromosome,
    615         "strand": normalize_strand,
    616     },
    617     infer_biotype_column=True,
    618     usecols=usecols,
    619     features=features,
    620 )
    622 column_names = set(df.keys())
    623 expect_gene_feature = features is None or "gene" in features

File path/to//python3.9/site-packages/gtfparse/read_gtf.py:261, in read_gtf(filepath_or_buffer, expand_attribute_column, infer_biotype_column, column_converters, usecols, features, result_type)
    258 else:
    259     result_df = parse_gtf(result_df, features=features)
--> 261 result_df = result_df.with_columns(
    262     [
    263         polars.col(column_name).map_elements(lambda x: column_type(x) if len(x) > 0 else None)
    264         for column_name, column_type in column_converters.items()
    265     ]
    266 )
    268 # Hackishly infer whether the values in the 'source' column of this GTF
    269 # are actually representing a biotype by checking for the most common
    270 # gene_biotype and transcript_biotype value 'protein_coding'
    271 if infer_biotype_column:

File path/to//python3.9/site-packages/polars/dataframe/frame.py:8235, in DataFrame.with_columns(self, *exprs, **named_exprs)
   8088 def with_columns(
   8089     self,
   8090     *exprs: IntoExpr | Iterable[IntoExpr],
   8091     **named_exprs: IntoExpr,
   8092 ) -> DataFrame:
   8093     """
   8094     Add columns to this DataFrame.
   8095 
   (...)
   8233 
   8234     """
-> 8235     return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)

File path/to//python3.9/site-packages/polars/lazyframe/frame.py:1749, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1746 if background:
   1747     return InProcessQuery(ldf.collect_concurrently())
-> 1749 return wrap_df(ldf.collect())

PanicException: python function failed ValueError: Invalid strand: 4

The gtf file's lines look like this:

4   FlyBase gene    879 5039    .   +   .   "gene_id ""FBgn0267363""; gene_name ""JYalpha""; gene_source ""FlyBase""; gene_biotype ""protein_coding"";"
4   FlyBase transcript  879 5039    .   +   .   "gene_id ""FBgn0267363""; transcript_id ""FBtr0346692""; gene_name ""JYalpha""; gene_source ""FlyBase""; gene_biotype ""protein_coding""; transcript_source ""FlyBase""; transcript_biotype ""protein_coding"";"
4   FlyBase exon    879 1079    .   +   .   "gene_id ""FBgn0267363""; transcript_id ""FBtr0346692""; exon_number ""1""; gene_name ""JYalpha""; gene_source ""FlyBase""; gene_biotype ""protein_coding""; transcript_source ""FlyBase""; transcript_biotype ""protein_coding""; exon_id ""FBtr0346692-E1"";"
4   FlyBase CDS 930 1079    .   +   0   "gene_id ""FBgn0267363""; transcript_id ""FBtr0346692""; exon_number ""1""; gene_name ""JYalpha""; gene_source ""FlyBase""; gene_biotype ""protein_coding""; transcript_source ""FlyBase""; transcript_biotype ""protein_coding""; protein_id ""FBpp0312297"";"
4   FlyBase start_codon 930 932 .   +   0   "gene_id ""FBgn0267363""; transcript_id ""FBtr0346692""; exon_number ""1""; gene_name ""JYalpha""; gene_source ""FlyBase""; gene_biotype ""protein_coding""; transcript_source ""FlyBase""; transcript_biotype ""protein_coding"";"

I hope this information could be of help in resolving the relevant issues.

iskandr commented 6 months ago

Apologies, I didn't realize I had broken things. Working on adding some more unit tests and fixing whatever went wrong.

iskandr commented 6 months ago

Your custom GTF error should be fixed by https://github.com/openvax/pyensembl/commit/42bec8e05c2119a01ae02c2ab55bec1fbeca36ff

iskandr commented 6 months ago

OK, both errors should now be fixed. Sorry for breaking things and I'll try to get continuous integration working again (switching to GitHub actions from Travis for all OpenVax repos is taking a while)