Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
I went down a rabbit hole trying to unravel the many methods we have for rendering arrays. The status quo is not great (but not urgent). The methods I'm looking at are:
(EA|Index|Series|Dataframe).__repr__
Index.format (xref #55413)
(Series|DataFrame).to_string, to_html, to_latex, (maybe other Styler-like things im not familiar with?)
to_csv
to_json
(havent really looked at to_xml, to_stata, to_hdf, to_excel etc)
The pain points are roughly:
1) We special-case our internal EAs in ways that complicate the code and make it difficult to reason about. Some of these are just for perf, others actually break tests if we remove the special casing.
2) Keywords specific to dt64/td64 dtypes are used with our numpy dtypes but not for pyarrow dtypes or 3rd party dtypes. In particular I'm thinking of date_format in to_csv and in DatetimeIndexOpsMixin.format (xref #33319)
3) The boxed keyword in EA._formatter is documented as being True when rendering an EA inside a Index/Series/DataFrame, but the way it is enabled is via fallback_formatter in format_array may use it or not depending on spaghetti logic. Also for dt64/td64/period we dont box the values in Series/DataFrame but do in Index/EA.
AFAICT this is largely motivated by the idea that eval(repr(index)) is valid, which i don't particularly care about.
4) Many of the code paths cast to object in ways that look unnecessary.
5) _Timedelta64Formatter and _Datetime64Formatter have a nat_rep keyword in __init__ that is never passed. The caller format_array does pass na_rep (which defaults to "NaN" and using it expect would break a zillion tests)
6) to_json just doesn't work with general EA dtypes (xref #35420, #31917, #32037)
7) date_format in to_json behaves differently from everywhere else (xref #16492, #22317, #39135, #47930)
8) General hodgepodge of mismatched keywords in different to_foo methods (most of which is unavoidable)
Some ideas on improving the situation:
A) Deprecate Index.format (xref #55413) and implement Index._format_flat and Index._format_multi for internal use. We don't use most of the existing keywords internally, so the new methods would be appreciably simpler than the current ones.
B) add relevant keywords (float_format, decimal, date_format) to EA._formatter to make the fallback_formatter spaghetti in 3) unnecessary
C) add a vectorized formatting method to EA with the same behavior as our internal EAs' _format_native_values. Then avoid special-casing our internal EAs in the relevant places.
i) Note we used to have EA._formatting_values and i guess that was removed in favor of just _formatter. I think we could implement the default _vectorized_whatever in terms of _formatter so the new method would be optional for 3rd party authors.
ii) The "relevant places" would include blocks.to_native_types (used in to_csv), to_json, Index._format_native_values, and format_array
D) Change the boxed behavior to box in when inside and Index but not in Series/DataFrame.
E) for 5, just accept/document that na_rep is ignored for these dtypes and you'll always get "NaT"
F) standardize the "date_format" behavior across to_json and other places where it shows up
I went down a rabbit hole trying to unravel the many methods we have for rendering arrays. The status quo is not great (but not urgent). The methods I'm looking at are:
(EA|Index|Series|Dataframe).__repr__
Index.format
(xref #55413)(Series|DataFrame).to_string
,to_html
,to_latex
, (maybe other Styler-like things im not familiar with?)to_csv
to_json
to_xml
,to_stata
,to_hdf
,to_excel
etc)The pain points are roughly:
1) We special-case our internal EAs in ways that complicate the code and make it difficult to reason about. Some of these are just for perf, others actually break tests if we remove the special casing. 2) Keywords specific to dt64/td64 dtypes are used with our numpy dtypes but not for pyarrow dtypes or 3rd party dtypes. In particular I'm thinking of
date_format
into_csv
and inDatetimeIndexOpsMixin.format
(xref #33319) 3) Theboxed
keyword inEA._formatter
is documented as being True when rendering an EA inside a Index/Series/DataFrame, but the way it is enabled is viafallback_formatter
informat_array
may use it or not depending on spaghetti logic. Also for dt64/td64/period we dont box the values in Series/DataFrame but do in Index/EA.eval(repr(index))
is valid, which i don't particularly care about. 4) Many of the code paths cast to object in ways that look unnecessary. 5) _Timedelta64Formatter and _Datetime64Formatter have a nat_rep keyword in__init__
that is never passed. The callerformat_array
does passna_rep
(which defaults to "NaN" and using it expect would break a zillion tests) 6)to_json
just doesn't work with general EA dtypes (xref #35420, #31917, #32037) 7)date_format
into_json
behaves differently from everywhere else (xref #16492, #22317, #39135, #47930) 8) General hodgepodge of mismatched keywords in different to_foo methods (most of which is unavoidable)Some ideas on improving the situation:
A) Deprecate
Index.format
(xref #55413) and implementIndex._format_flat
andIndex._format_multi
for internal use. We don't use most of the existing keywords internally, so the new methods would be appreciably simpler than the current ones. B) add relevant keywords (float_format, decimal, date_format) toEA._formatter
to make the fallback_formatter spaghetti in 3) unnecessary C) add a vectorized formatting method to EA with the same behavior as our internal EAs'_format_native_values
. Then avoid special-casing our internal EAs in the relevant places. i) Note we used to have EA._formatting_values and i guess that was removed in favor of just_formatter
. I think we could implement the default_vectorized_whatever
in terms of_formatter
so the new method would be optional for 3rd party authors. ii) The "relevant places" would include blocks.to_native_types (used in to_csv), to_json, Index._format_native_values, andformat_array
D) Change theboxed
behavior to box in when inside and Index but not in Series/DataFrame. E) for 5, just accept/document thatna_rep
is ignored for these dtypes and you'll always get "NaT" F) standardize the "date_format" behavior acrossto_json
and other places where it shows up