API: creating DataFrame with no columns: object vs string dtype columns?

pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

https://pandas.pydata.org

BSD 3-Clause "New" or "Revised" License

43.86k stars 18.01k forks source link

API: creating DataFrame with no columns: object vs string dtype columns? #60338

Open jorisvandenbossche opened 1 week ago

jorisvandenbossche commented 1 week ago

A typical case we encounter in the tests is starting from an empty DataFrame, and then adding some columns.

Simplied example of this pattern:

df = pd.DataFrame()
df["a"] = values
...

The dataframe starts with an empty Index columns, and the default dtype for an empty Index is object dtype. And then inserting string labels for the actual columns into that Index object, preserves the object dtype.

As long as we used object dtype for string column names, this was perfectly fine. But now that we will infer str dtype for actual string column names, it gets a bit annoying that the pattern above does not result in str but object colums.

This is not the best pattern, so maybe it's OK this does not give the ideal result. But at the same since we even use it quite regularly in our own tests, I suppose this is not that uncommon.

WillAyd commented 1 week ago

I wonder if it would be less disruptive to have the empty Index default to a string data type and coerce to object as needed (at least when used in columns).

jorisvandenbossche commented 1 week ago

I was actually wrong about the default empty index being object dtype. While that is the case for directly creating an empty index, for DataFrame/Series we already deviate from that and create an empty range index:

>>> pd.DataFrame().index
RangeIndex(start=0, stop=0, step=1)
>>> pd.DataFrame().columns
RangeIndex(start=0, stop=0, step=1)
>>> pd.Index([])
Index([], dtype='object')

Now, the result is the same because inserting a string label in the integer-like range index also upcasts to object dtype.

But yeah, I think it could make sense for the columns to be string by default. This would be a backwards incompatible change for the case where you start with an empty dataframe and then insert columns with integer labels (that would then cast to object dtype, instead of preserving the integer dtype)

WillAyd commented 1 week ago

Good point, although its hard to make any guarantees about what the data type of an empty dataframe is with our current data model.

Might be another good motivating factor for PDEP-13 https://github.com/pandas-dev/pandas/pull/58455 to implement the Null type and use that as the default. That's of course a ways off; in the meantime I think we just have to make a best effort at this, which I think would be assuming string column labels

simonjayhawkins commented 2 days ago

As an index object is immutable and an empty index has no labels does it actually matter what the dtype is when adding rows/columns? Why do we find a common dtype and not just ignore the dtype of the zero length index when creating the new index?