pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.92k stars 18.03k forks source link

ENH: Add `Series.str.format` method #59356

Open luxedo opened 4 months ago

luxedo commented 4 months ago

Feature Type

Problem Description

Many times I wanted to convert a float column into percentage but it's very verbose. Adding a str.format method would make it easy to convert any numeric columns to percentage and also allow for many other use cases.

Feature Description

    # Series.str class
    def format(self, format_spec: str):
        return self.apply(lambda x: format_spec.format(x))

series.str.format("{:%}")

Alternative Solutions

series.apply(lambda x: "{:%}".format(x))

Additional Context

There's many alternatives for creating formatted strings, but this feature should add the possibility to store the formatted values in pd.Series instead.

I can submit a PR if this idea makes sense.

rhshadrach commented 4 months ago

Thanks for the request. It seems to me adding Series.format would not be any more performant nor significantly easier to use than the apply approach (in fact, would we just be calling apply in the implementation?). If that is the case, I don't think this is worth the cost to add to pandas (code, docs, tests, bugfixes).

luxedo commented 3 months ago

Yes, I agree that it's just a wrap around apply, but isn't that the case for many other str methods? My rationale comes from giving more meaning when trying to change data visualization. Also, may Python str methods have an equivalent Pandas method, so why not add this one?

rhshadrach commented 3 months ago

Also, may Python str methods have an equivalent Pandas method, so why not add this one?

A fair point! We would need to consider adding it for the various string arrays as well. If there is a performant .format we could introduce for PyArrow string arrays, I would be strongly in favor here. Otherwise, if that is still using apply then I'm -0.

cc @WillAyd

rhshadrach commented 3 months ago

Yes, I agree that it's just a wrap around apply, but isn't that the case for many other str methods?

While true, this is also one of the performance issues with pandas.

size = 100000
ser = pd.Series(size * ["this is a string "])
%timeit ser.str.rstrip().str.lstrip().str[-3:]
# 25 ms ± 129 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ser.apply(lambda x: x.rstrip().lstrip()[-3:])
# 11.7 ms ± 173 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The version that looks vectorized is actually slower. So I am quite hesitant to expand on this pattern if there is no gain.

WillAyd commented 3 months ago

I am also pretty neutral on this change, maybe even a -0.5.

While I see your point about there being a Python str.format, most of the accessor methods we offer are computed in a vectorized way that improves performance. str.format is specific to the Python runtime, so there isn't much that can be improved over the apply approach

luxedo commented 3 months ago

I see this is not getting much traction lol but my argument goes more towards readability and standardization than performance.

I have another caveat. What if the format spec has more than one replacement fields like:

"{0} {1}"

How would the series treat this value? How should we display error information? Could it be implemented for DataFrame and still make sense?

I still think that this would simplify code and make it more readable, specially in a jupyter notebook context. But I also see that this is a little deeper that what I initially thought.

asishm commented 3 months ago

I don't think the idea should be rejected purely due to performance concerns, since pandas built-in string methods aren't really more performant than regular python string operations (I'd be happy to get corrected on this if I'm wrong, but there are open issues about pandas string performance)

WillAyd commented 3 months ago

@asishm that may have been true historically, but when using pyarrow for strings (which 3.0 will default to, if installed) performance will be much better

asishm commented 3 months ago

@WillAyd Thanks! I thought I had run some benchmarks on main with pyarrow strings before commenting, but I can't reproduce anymore.