openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

Polars < 0.17.0 required #291

Closed dweemx closed 9 months ago

dweemx commented 9 months ago

Hello,

I'm trying to download the Ensembl 107 release of Mouse:

$ pyensembl install --release 107 --species mus_musculus
2023-09-19 16:48:07,049 - pyensembl.shell - INFO - Running 'install' for EnsemblRelease(release=107, species='mus_musculus')
2023-09-19 16:48:07,050 - pyensembl.database - INFO - Creating database: ~/.cache/pyensembl/GRCm39/ensembl107/Mus_musculus.GRCm39.107.gtf.db
2023-09-19 16:48:07,050 - pyensembl.database - INFO - Reading GTF from ~/.cache/pyensembl/GRCm39/ensembl107/Mus_musculus.GRCm39.107.gtf.gz
Traceback (most recent call last):
  File " ~/envs/pyensembl/bin/pyensembl", line 10, in <module>
    sys.exit(run())
  File " ~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/shell.py", line 272, in run
    genome.index(overwrite=args.overwrite)
  File " ~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/genome.py", line 280, in index
    self.db.connect_or_create(overwrite=overwrite)
  File " ~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 284, in connect_or_create
    return self.create(overwrite=overwrite)
  File " ~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 206, in create
    df = self._load_gtf_as_dataframe(
  File " ~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 611, in _load_gtf_as_dataframe
    df = read_gtf(
  File " ~/envs/pyensembl/lib/python3.10/site-packages/gtfparse/read_gtf.py", line 254, in read_gtf
    result_df = parse_gtf_and_expand_attributes(
  File " ~/envs/pyensembl/lib/python3.10/site-packages/gtfparse/read_gtf.py", line 189, in parse_gtf_and_expand_attributes
    df = parse_gtf(
  File  ~/envs/pyensembl/lib/python3.10/site-packages/gtfparse/read_gtf.py", line 155, in parse_gtf
    df_lazy = parse_with_polars_lazy(
  File ~/envs/pyensembl/lib/python3.10/site-packages/gtfparse/read_gtf.py", line 87, in parse_with_polars_lazy
    polars.toggle_string_cache(True)
AttributeError: module 'polars' has no attribute 'toggle_string_cache'. Did you mean: 'enable_string_cache'?

It seems that from polars==0.17.0 there is a breaking change regarding this attribute. https://github.com/pola-rs/polars/releases/tag/py-0.17.0

This can be fixed when downgrading polars to 0.16.8. But when rerunning the command, then this leads to another error:

$ pyensembl install --release 107 --species mus_musculus
2023-09-19 16:52:20,886 - pyensembl.shell - INFO - Running 'install' for EnsemblRelease(release=107, species='mus_musculus')
2023-09-19 16:52:20,887 - pyensembl.database - INFO - Creating database: ~/.cache/pyensembl/GRCm39/ensembl107/Mus_musculus.GRCm39.107.gtf.db
2023-09-19 16:52:20,887 - pyensembl.database - INFO - Reading GTF from ~/.cache/pyensembl/GRCm39/ensembl107/Mus_musculus.GRCm39.107.gtf.gz
Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.thread '<unnamed>' panicked at 'python function failed ValueError: Invalid strand: 2', src/apply/series.rs:219:19
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
Traceback (most recent call last):
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/internals/expr/expr.py", line 3277, in wrap_f
    return x.apply(
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/utils.py", line 433, in wrapper
    return function(*args, **kwargs)
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/utils.py", line 498, in wrapper
    return function(*args, **kwargs)
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/internals/series/series.py", line 3736, in apply
    return wrap_s(self._s.apply_lambda(function, pl_return_dtype, skip_nulls))
pyo3_runtime.PanicException: python function failed ValueError: Invalid strand: 2
Traceback (most recent call last):
  File "~/envs/pyensembl/bin/pyensembl", line 10, in <module>
    sys.exit(run())
  File "~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/shell.py", line 272, in run
    genome.index(overwrite=args.overwrite)
  File "~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/genome.py", line 280, in index
    self.db.connect_or_create(overwrite=overwrite)
  File "~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 284, in connect_or_create
    return self.create(overwrite=overwrite)
  File "~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 206, in create
    df = self._load_gtf_as_dataframe(
  File "~/envs/pyensembl/lib/python3.10/site-packages/pyensembl/database.py", line 611, in _load_gtf_as_dataframe
    df = read_gtf(
  File ~/envs/pyensembl/lib/python3.10/site-packages/gtfparse/read_gtf.py", line 261, in read_gtf
    result_df = result_df.with_columns(
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 6122, in with_columns
    self.lazy()
  File "~/envs/pyensembl/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1160, in collect
    return pli.wrap_df(ldf.collect())
pyo3_runtime.PanicException: Unwrapped panic from Python code

I have narrowed down this issue to the following line in the code:

https://github.com/openvax/pyensembl/blob/d178ac926b7329b9f9d81574ecf2b17e554516c5/pyensembl/database.py#L615

By commenting this line, the issue get resolved but I'm not sure what exactly in normalize_strand is causing the issue.

dweemx commented 9 months ago

I solved my issues by installing pyensembl from GitHub instead of using pre-built images from conda