larray-project / larray

N-dimensional labelled arrays in Python
https://larray.readthedocs.io/
GNU General Public License v3.0
8 stars 6 forks source link

use pyarrow as default read_csv backend if available and no weird option is given? #1046

Open gdementen opened 1 year ago

gdementen commented 1 year ago

Pandas 1.4+ can read csv via the Arrow library. It is faster but does not support all the features of the default reader.

df = pd.read_csv("some.csv", engine="pyarrow")

Or we could bypass Pandas by using pyarrow directly. Even though I think it is a good idea for feather and HDF files for which the goal is maximum speed (because I don't expect people to often read files in those formats not written by LArray). For CSV, the most important point is to be able to read anything we throw at it, and only then do it quickly if possible. Pyarrow has many options for reading CSV files but I would rather stick with the Pandas API. The question is thus how stable is the PyArrow backend when only basic options are given.