openvax / gtfparse

Parsing tools for GTF (gene transfer format) files
Apache License 2.0
108 stars 25 forks source link

Conversion to Pandas in read_gtf has missing dependency #32

Open acrinklaw opened 1 year ago

acrinklaw commented 1 year ago

Just installed gtfparse on an ec2 instance today and ran into an issue where pyarrow is a required dependency but not declared or pulled in during setup. Installing pyarrow fixes the issue.

In [5]: df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")
INFO:root:Extracted GTF attributes: ['gene_id', 'Dbxref', 'ID', 'Name', 'gbkey', 'gene', 'gene_biotype', 'transcript_id', 'Parent', 'model_evidence', 'original_biotype', 'product', 'description', 'partial', 'Note', 'exception', 'inference', 'end_range', 'start_range', 'gene_synonym', 'protein_id', 'tag', 'pseudo', 'The', 'transl_except', 'anticodon', 'standard_name', 'non-AUG', 'codons', '12S', '16S', 'transl_table', 'ATPase', 'isoform', 'similar', 'exon_number', 'number']
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df = read_gtf("GCF_000001405.39_GRCh38.p13_genomic.gtf", result_type="pandas")

File /mnt/data/miniconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:292, in read_gtf(filepath_or_buffer, expand_attribute_column, infer_biotype_column, column_converters, usecols, features, result_type)
    289     result_df = result_df.select(valid_columns)
    291 if result_type == "pandas":
--> 292     result = result_df.to_pandas()
    293 elif result_type == "polars":
    294     result = result_df

File /mnt/data/miniconda3/lib/python3.9/site-packages/polars/internals/dataframe/frame.py:1962, in DataFrame.to_pandas(self, date_as_object, *args, **kwargs)
   1925 def to_pandas(
   1926     self, *args: Any, date_as_object: bool = False, **kwargs: Any
   1927 ) -> pd.DataFrame:
   1928     """
   1929     Cast to a pandas DataFrame.
   1930
   (...)
   1960
   1961     """
-> 1962     record_batches = self._df.to_pandas()
   1963     tbl = pa.Table.from_batches(record_batches)
   1964     return tbl.to_pandas(*args, date_as_object=date_as_object, **kwargs)

ModuleNotFoundError: No module named 'pyarrow'

Fantastic package by the way, I use it in quite a few different workflows. Please let me know if you have any outstanding ideas or issues you want some help with, I'd be happy to contribute!