pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.86k stars 18.01k forks source link

BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343

Open jorisvandenbossche opened 1 week ago

jorisvandenbossche commented 1 week ago

When not specifying a dtype (inferring the type), construction of Index or Series from dict keys goes fine:

>>> pd.options.future.infer_string = True
>>> d = {"a": 1, "b": 2}
>>> pd.Index(d.keys())
Index(['a', 'b'], dtype='str')

But if you explicitly specify the dtype, then it fails:

>>> pd.Index(d.keys(), dtype="str")
...

File ~/scipy/repos/pandas/pandas/core/arrays/string_arrow.py:206, in ArrowStringArray._from_sequence(cls, scalars, dtype, copy)
    203     return cls(pc.cast(scalars, pa.large_string()))
    205 # convert non-na-likes to str
--> 206 result = lib.ensure_string_array(scalars, copy=copy)
    207 return cls(pa.array(result, type=pa.large_string(), from_pandas=True))

File lib.pyx:727, in pandas._libs.lib.ensure_string_array()

File lib.pyx:822, in pandas._libs.lib.ensure_string_array()

ValueError: Buffer has wrong number of dimensions (expected 1, got 0)

The reason is that at that point we pass the data directly to the dtype's array _from_sequence instead of first pre-processing the data into a numpy array, and _from_sequence calling ensure_string_array directly doesn't seem to be able to handle dict keys (although we do call np.asarray(..) inside ensure_string_array, so not entirely sure what is going wrong)

tasfia8 commented 1 week ago

Hi Joris! If I fix this, I could send you a PR. Would you be able to merge my PR then or give suggestions on my PR so it can be merged? I have a school assignment deadline of working on an open source good first issue where the owner will at the end merge my PR. I was wondering if you can assign me this and help me? I am a 4th year Computer Engineering major.

tasfia8 commented 1 week ago

Also, would you be able to tell me what files I should look at for this so I can start? Do I fork the main branch?

KevsterAmp commented 1 week ago

Hi @tasfia8 kindly check the contributing docs: https://pandas.pydata.org/docs/development/contributing.html. For guidance regarding github issue assignment, proper format of PRs, etc...

I recommend you to work on an issue with a label good first issue since those issues mainly work on simple fixes that are good for first time contributors

tasfia8 commented 6 days ago

I have already started working on this, would you be able to assign me this? I think I can do it and I have read the contributing files thank you.

KevsterAmp commented 6 days ago

@tasfia8 - issue assignment can be found on the contributing docs

tasfia8 commented 6 days ago

take

tasfia8 commented 6 days ago

@jorisvandenbossche I think I have figured it out, just wanted to show both of @KevsterAmp and you before I make a PR. I will issue a PR soon and let you know. I get this as output now, is this what you are expecting? I have additional test cases as well and it passes all existing test cases as well. Output:

Screenshot 2024-11-19 at 3 17 08 AM

The issue was that dict_keys was passed directly to the StringDtype's _from_sequence method, which could not handle non-array-like inputs like dict_keys. The fix involved updating the handling of dict_keys during the construction of an Index or Series.

jorisvandenbossche commented 5 days ago

@tasfia8 apologies for the slow response. The output you show is indeed the expected behaviour. I think the easiest will be to make a PR so we can see the code and more easily give feedback (and feel free to mark the PR as "draft" if you are unsure if it is ready, but then we can already take a look)

tasfia8 commented 4 days ago

Done @jorisvandenbossche. Please see https://github.com/pandas-dev/pandas/pull/60383.