pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.74k stars 17.62k forks source link

Pandas string dtype needs from NumPy - prototyping & plan of attack #47884

Open rgommers opened 1 year ago

rgommers commented 1 year ago

The purpose of this issue is to discuss a plan of attack for improving string dtypes in NumPy to better suit Pandas.

Context

  1. @seberg has spent a ton of effort improving the infrastructure NumPy offers for implement dtypes, in NumPy itself and as third-party dtypes. See NEP 40-43. It's far enough along now that it makes sense to start using it, even if some things may still be missing. String dtypes were explicitly thought about in that design.
  2. Pandas is a main potential consumer of new/improved string dtypes. There's current two ways to do strings in Pandas, via object (no longer recommended) and via StringDtype (which can have multiple implementations it looks like): https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types.
  3. There's bandwidth available (thanks to a cross-project NASA grant for the next 2.5 years, @jreback was involved in adding this topic to the grant from the Pandas side) to work on a prototype for improved string dtypes that can improve on what NumPy offers today, focused on Pandas needs. This can live in a separate repo / a Pandas fork for quite a while.

There's a ton of relevant threads and issues for both NumPy and Pandas, I'm not going to try to link them all here.

Proposed way of approaching this

There's folks from Pandas (I think at least @jreback, @jbrockmendel and @jorisvandenbossche), NumPy (@seberg, @mattip), the NASA grant (@peytondmurray who will do some of the heavy lifting here on the prototype, Cc @dharhas as PI) with an interest in this. It's probably also relevant for other dataframe libraries; what Arrow provides is relevant; the dataframe interchange protocol probably too. In short: many potentially interested people and projects. So I'd suggest we add comments, new ideas, concerns on this issue - and then also have a call next week with whoever is interested, to have a bit higher-bandwidth conversation on how to get started.

A few thoughts on what to do

  1. A true variable-length string dtype for NumPy is probably most interesting (more so than, for example, reimplementing the fixed-length dtypes in the new dtype framework). Such a variable-length dtype is also mentioned on the NumPy Roadmap. So best to only focus on that first.
  2. Start working in a separate repo for this, and link it from here. I'll also note that @seberg has a bunch of example dtypes (including one string one) in https://github.com/seberg/experimental_user_dtypes.
  3. Collect Pandas wishes, needs and pain points in this issue. Cross-link to other issues as appropriate (I apologize for not digging through the Pandas issue tracker to make a start - I figured that Pandas devs may know already what is most relevant here, and I don't).
jorisvandenbossche commented 1 year ago

A potential tricky aspect might be missing values. Currently, all the variants of string data types in pandas support that (with object dtype you can put anything in there and pandas will see all of None/np.nan/pd.NA as missing, and with the StringDtype it is still using an object dtype array under the hood so it works the same, we just limit to a single possible NA value). If there would be a new variable length string dtype in numpy, I assume it will only allow actual strings inside the arrays?

There will of course be workarounds possible if the numpy array itself doesn't support missing values, like using it together with a boolean mask (just like we do for the numeric nullable arrays, while right now for StringDtype / StringArray, we don't use a mask because the object dtype array can already hold the missing value). So in that sense it would only be consistent with how to deal with that for numeric data types (it was only that, because of using object dtype for strings, it was more flexible up to now)

seberg commented 1 year ago

We can encode NA into the dtype, that is no problem. But I agree, there a couple of open questions around NA.

NumPy could just support the NA part, in which case NA needs to be available to NumPy though (move NA to NumPy?). Or there could be two DTypes or a dtype flag indicating NA support and NumPy could reject the NA version.

Sinnce we probably want to do NA as a special value (i.e. NULL/pd.NA object) and not a bitmask, I do think NumPy should probably just support it, even if support could be a bit spotty w.r.t to actually working with that NA value.

peytondmurray commented 1 year ago

Hey all, here's a little bit of background information on this effort as well as some of the major points that were discussed with folks during the 2022-07-28 Data API call.

Background

Originally, pandas used the numpy object dtype which is quite slow. More recently, StringDtype, which uses pyarrow under the hood, has become available. Current pandas users are able to switch between object and this new StringDtype, though object remains the default for strings. However, lately there's been a lot of work on numpy's dtype infrastructure, including a type hierarchy as well as a new way for users to define custom dtypes, opening the door for work to begin on a new, modern string dtype in numpy which pandas could use instead.

One of the main motivations of the dtypes work has been improved string dtypes, for which pandas is expected to be the main consumer. While the new dtype infrastructure is not yet complete, it's usable at the moment. Importantly, there's now some bandwidth available because of a cross-project NASA grant with support for a dedicated resource for a few days a week into the forseeable future.

The new dtypes will be built in a repo external to numpy; one or more prototypes will be built in this separate repo, and then used in a fork of pandas. After iterating a few times, checking as we go that the new dtypes solve some of the existing pain points, we will move toward a replacement string dtype for numpy.

Goals & Discussion

pandas appears to be the project which stands to be affected most by the introduction of a new string dtype. Currently pandas uses two string backends:

  1. A chunked array for the StringDType which uses pyarrow under the hood
  2. Numpy object dtype, which is the default

As it stands, pandas doesn't guarantee any specific string representation in memory, so this is something that can be changed in the future or swapped out. In the future, it wouldn't be unreasonable to expose the string buffers to the user.

On a related note, numpy will in the future work to make it more difficult to inadvertently use object dtypes.

Initial feedback from folks during the 2022-07-28 Data API call

The main implementation choice will use two arrays, one to hold the characters, and another to hold offsets pointing to the start of each element. This approach is already used by the dataframe interchange protocol as well as by Apache Arrow.

Alternative implementations

The arrow mailing list is discussing a new proposed string memory layout:

See these discussions for more information.

This format is already popular, and is used in a number of databases: monetdb, hyper, duckdb, and velox. This seems to be moving forward, but isn't something that could be pushed out over an interchange protocol because it involves pointer swizzling.

seberg commented 1 year ago

The main implementation choice will use two arrays, one to hold the characters, and another to hold offsets pointing to the start of each element

Presumably, this is a per array storage. NumPy has no clear concept of that, although maybe it could be added (the problem is mainly about views I suspect). The alternative may be a per "dtype" storage. arr.dtype.data would basically hold that storage chunk. In that case it would be desirable to clone the dtype of arr2.dtype unless arr2 is a known view into arr (which should work, but needs careful checks in a few array creation places probably).

but isn't something that could be pushed out over an interchange protocol because it involves pointer swizzling.

In principle we could have StringDType(storage_scheme="...") to define multiple storage schemes. You can still always use the default one for ufuncs, etc. Atlhough, exchange is nicest if it is zero-copy, which this would effectively not be.

ngoldbaum commented 1 year ago

I have some updates on this effort.

tl;dr: we're thinking of initially implementing the string dtype as a strongly-typed object array by storing pointers to string buffers instead of storing the string data as variable width elements in the ndarray array storage buffer. I'd like feedback on this idea from stakeholders before we go further.

Repository for user dtypes in the Numpy github org

Since @peytondmurray's last update, @seberg created a new repository for the dtype code to live in:

https://github.com/numpy/numpy-user-dtypes

It's likely that eventually some of these dtypes will be upstreamed to Numpy, but keeping them separate for now allows easier iteration and experimentation.

Initial work on string dtypes

For the past month or so I've been working on asciidtype, a fixed-width dtype representing ASCII strings. The main goal was to write a simple string dtype in the new numpy dtype API to elucidate any issues with string dtypes while avoiding any complexities arising from dealing with variable-width strings or unicode.

I'm now feeling more confident about moving on to the real variable width string dtype Pandas needs, but I think we need to implement this using a different approach than what Peyton described in August.

Problems with storing variable-width data in Numpy arrays

Our initial plan for this involved storing the variable-length strings in the numpy array itself, with an auxiliary array holding an index into the array buffer for the locations of the array elements. This additional array would be stored on the dtype or make use of an as-yet undeveloped facility in Numpy for per-array storage.

Both approaches would require modifications to Numpy. We would either need to add a facility for per-array storage, or if we store the offsets on the dtype, we would likely need to modify how dtypes are handled when array copies or views are created to ensure a new offset array is created as appropriate.

Even if we solve those issues, I realized recently that any kind of variable-length dtype is going to run into other, more fundamental issues with Numpy's assumptions that array data are fixed-width. One example in the new dtype API is the current signature of getitem and setitem for dtypes:

PyObject * user_dtype_getitem(PyArray_Descr *dtype, char *dataptr)

That is, it takes only a reference to the array's dtype instance and a pointer to the array buffer. For variable-width strings, we would have no way of knowing the length of the string dataptr is pointing at. To get this to work in a natural way, we would need to modify the dtype API so that (one possible approach) both getitem and setitem also accept the index of the array element being selected. Additionally, dtypes currently have no hooks into the numpy array iteration infrastructure, so it's likely we'd also need to modify the dtype API to add those hooks so that dataptr is pointing at the correct location and operations like array[::3] do the correct thing, since strided access will no longer correspond to fixed-width offsets in memory to select array elements.

New plan: store pointers to strings

We are now leaning in the direction of storing pointers to string arrays in the array storage. This avoids the issues with variable-width data storage in ndarray, since interally we'd be storing one pointer-width integer for each string. I believe it should also be possible to implement this approach using the experimental dtype API in Numpy as it currently exists.

The main downside is that ufuncs, casts, and other operations that require looping over all the data will need to go through a pointer for each array element and without some care around the storage strategy, will not use CPU caches efficiently.

That said, performance will likely be improved compared with the object dtype, as the dtype will know that the pointers are to string arrays, and there will be no need to go through the Python C API and no need to acquire the GIL in ufuncs or casting loops to unwrap PyObject instances and access the string data.

My take is that we don't know if the performance of this approach is unacceptable until we try, and we can always go back and apply optimizations afterwards if we need to. A simpler implementation will also allow us to have a prototype we can use to explore integration with pandas.

ngoldbaum commented 1 year ago

Since my last update @peytondmurray and I have made a bunch of progress on the string dtype.

It now mostly works. I'm sure there's still lots of stuff that needs to be added, but basic operations work fine. To get a feeling where we stand in terms of explicitly supported things, take a look at the unit tests. @peytondmurray has been focusing on expanding functionality and adding support for ufuncs where that makes sense.

I have a development branch of pandas in my github fork that supports creating pandas data types from StringDType numpy arrays. I also added a NumpyStringArray pandas extension array. This makes it possible to pass string[numpy] as a dtype to pandas functions, and from a user-facing perspective should be behave identically to string[python].

I also added support for missing data and added a hook so that the missing data value used by the instances of the dtype used by NumpyStringArray are pandas.NA. I still need to integrate this with the pandas tests, but that's next on my list. I'm sure I'll find lots of other things that break in the process.

The development branch of pandas I'm working on can't be upstreamed until Numpy 1.25 is released at the earliest. At that point it will become possible to run tests on pandas' CI that use stringdtype, since stringdtype relies on a features of the experimental numpy dtype C API that aren't available in Numpy 1.24. It may also be the case that pandas doesn't want to support stringdtype until it is upstreamed in Numpy, in which case upstreaming all my changes will have to wait for that.

I'm trying to regularly rebase my changes on the pandas main branch and am working to upstream any changes that do make sense to upstream right now. For example, I put in a PR to unify all the block storage classes that rely on NumPy arrays into a single NumpyBlock class, which makes it easier to generically support all numpy dtypes, including new dtypes that aren't yet in numpy itself, I also put in some PRs to refactor the string asv benchmarks to make them more fair.

Speaking of benchmarks, right now performance of most operations are roughly comparable between object string arrays and the NumpyStringArray extension array I added. If you use a StringDType numpy array directly, that's substantially faster. I'm hopeful there's more low-hanging optimizations we can find. On a longer timeline it should be possible to improve the performance of most string method operations by adding support in numpy for string ufuncs. However, that's a much larger piece of work as it will require figuring out where the ufuncs should live in Numpy and what the user-facing API should look like. I don't think it's necessary to get that done for the MVP, but it's something I'd like to figure out better since that will bring performance closer to PyArrow string arrays operations.

jbrockmendel commented 1 year ago

added a hook so that the missing data value used by the instances of the dtype used by NumpyStringArray are pandas.NA

Your call, but it isn't obvious to me this is the way to go. (I'm generally ornery about pd.NA). My preference would be for pandas to treat this like any other numpy array/dtype. Ideally we wouldn't even need to stuff it into an ExtensionArray. Maybe that's what you're referring to in your last paragraph about using StringDtype directly?

ngoldbaum commented 1 year ago

Ideally we wouldn't even need to stuff it into an ExtensionArray. Maybe that's what you're referring to in your last paragraph about using StringDtype directly?

Yes I am using an ExtensionArray. I want both the ExtensionArray and using the dtype directly to function correctly. I initially wanted to avoid the ExtensionArray but doing it this way lets me use the ExtensionArray tests “for free” to find places where we need to fix things on the numpy side. Also it allows users to switch from dtype=“string” to dtype=“string[numpy]” and ideally see no behavior changes but get improved performance and memory consumption for free.

ngoldbaum commented 11 months ago

I proposed a NEP to upstream the dtype prototype to NumPy, targeting NumPy 2.0.

If the NEP is accepted, I will start to work on moving the code into NumPy itself. Once the DType implementation is merged in numpy's development branch, I will rework my Pandas fork to use the built-in StringDType instead of the implementation outside of NumPy.

This is the version I will propose to get upstreamed to Pandas, hopefully in only a few pull requests. This might happen as early as this fall or winter if the NEP process and upstreaming goes smoothly, but could also slip to 2024.

I hope it won't be a controversial feature given the improvements over object arrays, but I also understand there's some desire to move away from numpy and towards pyarrow, so I'm not assuming support will be merged in Pandas.

If anyone is interested in playing with or giving feedback on the NumPy StringDType prototype or my fork of pandas using it, please feel free to reach out.

ngoldbaum commented 6 months ago

Hi all, it's now looking like NEP 55 will be accepted and stringdtype will ship in NumPy 2.0. I already have patches to support stringdtype in Pandas, so hopefully we'll be able to simultaneously have versions of pandas and numpy that support UTF-8 strings very soon and have no need for string object arrays unless a user explicitly passes one in.

If for some reason that timing slips and we don't ship it, I still expect stringdtype to be available in numpy dev within the next few weeks.

As soon as stringdtype is available in numpy dev, my plan is to update my pandas patches to account for stringdtype being available in numpy and propose a pandas PR. It's not a trivial amount of code, but it's also not a huge amount (currently +484 -150 lines, spread across 20 or so files). A lot of that code will be simplified once I don't need to depend on an external stringdtype package outside numpy, which caused a number of circular imports in my prototype.

I went over all of this a bit with @MarcoGorelli and @jreback in an internal call for the NASA ROSES grant late last year, and I'm happy to chat about this on a video call and present my slides about it if anyone is interested before I propose the PR to get early feedback.