qiime2 / q2-taxa

BSD 3-Clause "New" or "Revised" License
3 stars 29 forks source link

Collapse cannot handle large sparse tables due to pd.DataFrame #135

Closed wasade closed 3 years ago

wasade commented 3 years ago

Improvement Description collapse should use biom.Table.collapse which operates on a sparse representation of the data.

Current Behavior The collapse method requires a FeatureTable transform to pd.DataFrame coercing a dense representation. This is prohibitive for large datasets, artificially requiring in excess of >100GB.

Proposed Behavior Change collapse to accept biom.Table.collapse. Most of the surrounding changes should be minor as that collapse method accepts an arbitrary function. I recommend using norm=False within the collapse.