ENH/PERF: provide pyarrow engine option for to_csv

jorisvandenbossche commented 1 year ago

We added the engine="pyarrow" option to read_csv(), but we could also do the equivalent for writing CSVs with to_csv.

Also for writing CSVs, the pyarrow.csv writer can give a significant speed-up (especially because our own writer is pure python). Quick showcase with full numeric dataframe:

In [1]: df = pd.DataFrame(np.random.randn(1_000_000, 10), columns=list("abcdefghij"))

In [2]: %time df.to_csv("test_pandas.csv", index=False)
CPU times: user 10.7 s, sys: 418 ms, total: 11.1 s
Wall time: 12.2 s

In [3]: from pyarrow.csv import write_csv

In [4]: %timeit write_csv(pa.table(df), "test_arrow.csv")
1.88 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Similarly as for reading CSVs, we will need to do some mapping of our keywords to the pyarrow's keywords and set some matching defaults. For the example above (without any custom settings), there are some small differences: the float format precision seems the same by default, but the quoting of strings is different (that might be something to report to Arrow to make this more configurable, currently all or nothing quoting for strings):

$ head -3 test_pandas.csv
a,b,c,d,e,f,g,h,i,j
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223
$ head -3 test_arrow.csv
"a","b","c","d","e","f","g","h","i","j"
0.9524744196076045,0.21913063081328743,-1.3427643339799094,-2.66862972948282,0.09749875898199477,1.2870849976054641,-0.4253992607503571,0.7960922061946342,-0.5458462139415978,1.797736594226238
-0.9861117158157412,-0.14856474665751174,0.7884605447776409,0.5774211281637796,-0.024799957231053778,0.2859682446685537,1.0508204680473783,-1.1513513705558094,0.3334435129938111,-0.28739104967528223

jbrockmendel commented 1 year ago

hypothetically if pyarrow made required, is there a viable option where we detect if the pyarrow writer is equivalent and dispatch automatically? i.e. get the perf benefit without a new keyword

lithomas1 commented 1 year ago

hypothetically if pyarrow made required, is there a viable option where we detect if the pyarrow writer is equivalent and dispatch automatically? i.e. get the perf benefit without a new keyword

From my experience, the pyarrow readers we have so far other than parquet always support less features/keywords than what we have in pandas, so we can't do this right off the bat.

We can always add a engine keyword temporarily, and then deprecate the Cython/Python thing we use for writing afterwards (assuming pyarrow becomes a required dep of pandas).

SysuJayce commented 1 year ago

Need this feature too.

In my experiments, 10GB data takes 4mins for pandas to_csv and 10s for pyarrow write_csv

jbrockmendel commented 1 year ago

From my experience, the pyarrow readers we have so far other than parquet always support less features/keywords than what we have in pandas, so we can't do this right off the bat.

Right. The suggestion is that there is a subset of keywords (and maybe dtypes?) such that if the user only passes these keywords, then we can use the pyarrow writer. Can we automatically identify if we're in such a case and use the more performant writer without needing a new keyword?

lithomas1 commented 1 year ago

Just had a go at this.

It wasn't too hard to add in, but making tests pass is going to be really challenging.

Main issues I was running in to are

1) index=True doesn't work. Wasn't expecting this to work. IIRC, pyarrow doesn't have an "index" concept like pandas, but it's kinda annoying since the default for to_csv is to write out the index.

2) We allow to_csv to return a string if the path is None. This is used a bunch in the tests, but we might wanna consider deprecating this since you can't round-trip via CSV this way anyways (you need to wrap the string in a StringIO for read_csv to work).

I think I'm probably going to refactor the tests to group all the index stuff together, and remove/reduce usage of to_csv returning a string in the tests first.

jorisvandenbossche commented 1 year ago

index=True doesn't work. Wasn't expecting this to work. IIRC, pyarrow doesn't have an "index" concept like pandas, but it's kinda annoying since the default for to_csv is to write out the index.

Setting preserve_index to True when converting the pandas DataFrame to pyarrow (as in pa.Table.from_pandas(df, preserve_index=True)) doesn't fix that?

2. We allow to_csv to return a string if the path is None. This is used a bunch in the tests,

That might be relatively easy to replicate by writing to a BytesIO object, and then converting that to a string before returning?

In [112]: import io

In [113]: table = pa.table({'a': [1, 2, 3]})

In [114]: from pyarrow.csv import write_csv

In [116]: buf = io.BytesIO()

In [117]: write_csv(table, buf)

In [118]: buf.getvalue()
Out[118]: b'"a"\n1\n2\n3\n'

(we could still consider deprecating it, but short term that might be the easiest to implement it if it's used a lot in the tests)

lithomas1 commented 1 year ago

Sorry for the really late update on this, but I have a PR up now (#54171).

I've kinda addressed the first point by converting the index to a column with a name of "" (this probably doesn't work for MultiIndex), and I still have to address the second point of writing to a string (shouldn't be too hard, though).

Nearly all the tests still fail, but at least 5 tests pass, which gives me some confidence in the index hack.

pandas-dev / pandas

ENH/PERF: provide pyarrow engine option for to_csv #53618