openvax / pyensembl

Python interface to access reference genome features (such as genes, transcripts, and exons) from Ensembl
Apache License 2.0
365 stars 68 forks source link

Try using Polars to read GTF files #272

Closed iskandr closed 1 year ago

iskandr commented 1 year ago

A big performance bottleneck in generating reference data is parsing a big GTF (with all of our wacky heuristics for handling errors across Ensembl versions).

It's possibly faster to do the core TSV parsing in Polars and then transform the data from there.