pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.09k stars 1.83k forks source link

Specifying NumPy dtype in to_numpy method #17620

Open OyiboRivers opened 1 month ago

OyiboRivers commented 1 month ago

Description

The "to_numpy" method in Polars currently converts a Series or DataFrame to a NumPy array, but it doesn't allow specifying the desired NumPy dtype during conversion. Please enhance the "to_numpy" method to specify the resulting dtype.

# Example:
series = pl.Series(name='data', values=['A', 'B', 'C'], dtype=pl.String)
dtype = np.dtypes.StringDType(na_object=np.nan)

# present conversion
array = series.to_numpy().astype(dtype)
# array(['A', 'B', 'C'], dtype=StringDType(na_object=nan))

# desired conversion
array = series.to_numpy(dtype=dtype)
s-banach commented 1 month ago

7283

OyiboRivers commented 1 month ago

Thank you for the xref @s-banach ,

I just want to point out a difference between conversion of numbers and strings.

Numeric dtypes: The conversion is efficient because NumPy can directly use the same memory layout as Polars. The base shows the original Polars data structure, meaning no extra copy was needed.

numbers = pl.Series(values=[1, 2, 3], dtype=pl.Float64)
arr_f64 = numbers.to_numpy()
print(arr_f64.dtype)    # float64
print(arr_f64.base)     # <builtins.PySeries object at 0x7f65dc231710>

String dtypes: Converting to a specialized string dtype involves more work. The base being None confirms that an additional copy has occurred.

strings = pl.Series(values=['A', 'B', 'C'], dtype=pl.String)
dtype = np.dtypes.StringDType(na_object=np.nan)

arr_obj = strings.to_numpy()
print(arr_obj.dtype)    # object
print(arr_obj.base)     # <builtins.PySliceContainer object at 0x7f65ae7af7b0>

arr_str = strings.to_numpy().astype(dtype, copy=False)
print(arr_str.dtype)    # StringDType(na_object=nan)
print(arr_str.base)     # None
OyiboRivers commented 1 month ago

Specifiying dtypes in to_numpy could be useful for the interaction with heterogenous data types, too.

The interaction between structured NumPy arrays and Polars DataFrames is straightforward. You can switch between the structures using from_numpy and to_numpy. There are two main disadvantages:

import polars as pl
import numpy as np

numbers = [1, 2, 3]
strings = ['A', 'B', 'C']

# *** Structured Arrays ***
struct_dtype = np.dtype([('A', 'i8'), ('B', '<U1')])
col_1 = struct_dtype.names[0]
structured_array = np.array(list(zip(numbers, strings)), struct_dtype)

print('Layout of structured array:')
print(f"C-order: {structured_array.flags['C_CONTIGUOUS']}")  # True
print(f"F-order: {structured_array.flags['F_CONTIGUOUS']}")  # True

print('Layout of structured array columns:')
print(f"C-order: {structured_array[col_1].flags['C_CONTIGUOUS']}")  # False
print(f"F-order: {structured_array[col_1].flags['F_CONTIGUOUS']}")  # False

# Polars - NumPy interop
df = pl.from_numpy(structured_array)
new_structured_array = df.to_numpy(structured=True, allow_copy=True)
print(new_structured_array.base)  # None
# new_structured_array = df.to_numpy(structured=True, allow_copy=False)
# RuntimeError: copy not allowed: cannot create structured array without copying data

A way to circumvent the shortcomings could be to use structured subarrays with contiguous columns. Unfortunately, there seems to be no explicit way to transform structured subarrays into a DataFrame and the DataFrame back into structured subarrays without some manual intervention.

# *** Structured Subarrays ***
subarray_dtype = np.dtype([('A', 'i8', (3,)), ('B', '<U1', (3,))])
col_1 = subarray_dtype.names[0]
struct_subarrays = np.array([(numbers, strings)], subarray_dtype)

print('Layout of structured subarrays:')
print(f"C-order: {struct_subarrays.flags['C_CONTIGUOUS']}")  # True
print(f"F-order: {struct_subarrays.flags['F_CONTIGUOUS']}")  # True

print('Layout of subarrays:')
print(f"C-order: {struct_subarrays[col_1].flags['C_CONTIGUOUS']}")  # True
print(f"F-order: {struct_subarrays[col_1].flags['F_CONTIGUOUS']}")  # True

# Polars - NumPy interop
df = pl.DataFrame(struct_subarrays).explode(subarray_dtype.names)
new_struct_subarrays = np.empty(1, subarray_dtype)
for col in subarray_dtype.names:
    new_struct_subarrays[col] = df[col].to_numpy(allow_copy=True)
    # new_struct_subarrays[col] = df[col].to_numpy(allow_copy=False)
    # RuntimeError: copy not allowed: cannot convert to a NumPy array without copying data
print(new_struct_subarrays.base)  # None