scan pandas improvement

acquamarin commented 11 months ago

[ ] Panda dataframe can be a dictionary. (Duckdb binds it using numpy::bind)
[x] Support numpy array of objects.
[x] Implement import/variable cache mechanism in pybind to optimize performance.
[ ] Support category type.

prrao87 commented 8 months ago

@acquamarin and @ray6080, I'm wondering if this feature can be prioritized? I'm looking at showcasing the pandas dataframe scan functionality via a Kùzu byte, but in my view this functionality not very useful right now because it can only handle integers and float arrays (the example code below fails when you input a list of strings).

Also, from a usability and DevEx perspective, I think we should auto-cast Python lists to the corresponding numpy type, without the user having to manually specify numpy arrays, like we currently show in the docs.

person = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6, 7],
        # "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Fred", "George"],
        "age": [42, 23, 33, 57, 67, 39, 11],
        "height": [167, 172, 183, 199, 149, 154, 165],
        "is_student": [False, True, False, False, False, False, True],
    }
)

result = conn.execute(
    """
    CALL READ_PANDAS("person")
    RETURN age as age, height / 2.54 as height_in_inch
    """
).get_as_df()

"""
The above code fails when we uncomment the string column.
"""

To make this a useful feature, I'm recommending that we allow the user to specify all inputs to a pandas DataFrame as Python lists, rather than manually converting to numpy arrays on their end. If there's an error in casting the Python type to a numpy array (for e.g., if the user mistakenly adds a mixed-type list ([1, "a"]), the error message should be informative enough to know that there's a type issue, so it might make sense to perform this check/test on the Python side.

ray6080 commented 8 months ago

@prrao87 Yeah, I totally agree on this. We should prioritize the support of this to make the feature more usable. I think we can discuss with @acquamarin on how much needs to be done, and when we can schedule this to be done before the next release.

kuzudb / kuzu

scan pandas improvement #2405