pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.56k forks source link

BUG: None becomes empty string when writing multiple columns to CSV, but double quotes "" when writing single columns #59116

Open mvashishtha opened 4 days ago

mvashishtha commented 4 days ago

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame([['a', 0], [None, 1]], columns=['col0', 'col1'])

df.to_csv('all_columns.csv', index=False, columns=['col0', 'col1'])
print(f'CSV from writing all columns:\n{open("all_columns.csv").read()}')

df.to_csv('first_column.csv', index=False, columns=['col0'])
print(f'CSV from writing all columns:\n{open("first_column.csv").read()}')

Issue Description

The None in the dataframe becomes an empty string when writing both columns, but "" when writing just the first column.

Expected Behavior

Should be consistent and use the representation for missing data in na_rep in both cases.

Installed Versions

versions ``` INSTALLED VERSIONS ------------------ commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.9.19.final.0 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.3.1 Cython : None pytest : 7.4.4 hypothesis : None sphinx : 7.3.7 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.12.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.3.1 gcsfs : None matplotlib : 3.8.4 numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 16.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None ```
asishm commented 4 days ago

The current pandas behavior matches python stdlib csv. The reason quoting becomes necessary is to distinguish a 1-element row with the single empty field from an empty row.

In [64]: with StringIO() as f:
    ...:     writer = csv.writer(f)
    ...:     writer.writerow(['col0'])
    ...:     writer.writerow(['a'])
    ...:     writer.writerow([None])
    ...:     f.seek(0)
    ...:     print(f.read())
    ...:
col0
a
""

Setting writer = csv.writer(f, quoting=csv.QUOTE_NONE) raises with Error: single empty field record must be quoted

xref: https://github.com/pandas-dev/pandas/issues/18676 xref cpython: https://github.com/python/cpython/issues/76436