Open luxedo opened 4 months ago
Thanks for the request. It seems to me adding Series.format
would not be any more performant nor significantly easier to use than the apply
approach (in fact, would we just be calling apply
in the implementation?). If that is the case, I don't think this is worth the cost to add to pandas (code, docs, tests, bugfixes).
Yes, I agree that it's just a wrap around apply
, but isn't that the case for many other str
methods? My rationale comes from giving more meaning when trying to change data visualization. Also, may Python str methods have an equivalent Pandas method, so why not add this one?
Also, may Python str methods have an equivalent Pandas method, so why not add this one?
A fair point! We would need to consider adding it for the various string arrays as well. If there is a performant .format
we could introduce for PyArrow string arrays, I would be strongly in favor here. Otherwise, if that is still using apply
then I'm -0.
cc @WillAyd
Yes, I agree that it's just a wrap around
apply
, but isn't that the case for many otherstr
methods?
While true, this is also one of the performance issues with pandas.
size = 100000
ser = pd.Series(size * ["this is a string "])
%timeit ser.str.rstrip().str.lstrip().str[-3:]
# 25 ms ± 129 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ser.apply(lambda x: x.rstrip().lstrip()[-3:])
# 11.7 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The version that looks vectorized is actually slower. So I am quite hesitant to expand on this pattern if there is no gain.
I am also pretty neutral on this change, maybe even a -0.5.
While I see your point about there being a Python str.format, most of the accessor methods we offer are computed in a vectorized way that improves performance. str.format is specific to the Python runtime, so there isn't much that can be improved over the apply approach
I see this is not getting much traction lol but my argument goes more towards readability and standardization than performance.
I have another caveat. What if the format spec has more than one replacement fields like:
"{0} {1}"
How would the series treat this value? How should we display error information? Could it be implemented for DataFrame
and still make sense?
I still think that this would simplify code and make it more readable, specially in a jupyter notebook context. But I also see that this is a little deeper that what I initially thought.
I don't think the idea should be rejected purely due to performance concerns, since pandas built-in string methods aren't really more performant than regular python string operations (I'd be happy to get corrected on this if I'm wrong, but there are open issues about pandas string performance)
@asishm that may have been true historically, but when using pyarrow for strings (which 3.0 will default to, if installed) performance will be much better
@WillAyd Thanks! I thought I had run some benchmarks on main with pyarrow strings before commenting, but I can't reproduce anymore.
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
Many times I wanted to convert a float column into percentage but it's very verbose. Adding a
str.format
method would make it easy to convert any numeric columns to percentage and also allow for many other use cases.Feature Description
Alternative Solutions
Additional Context
There's many alternatives for creating formatted strings, but this feature should add the possibility to store the formatted values in
pd.Series
instead.I can submit a PR if this idea makes sense.