Open OyiboRivers opened 1 month ago
Thank you for the xref @s-banach ,
I just want to point out a difference between conversion of numbers and strings.
Numeric dtypes: The conversion is efficient because NumPy can directly use the same memory layout as Polars. The base shows the original Polars data structure, meaning no extra copy was needed.
numbers = pl.Series(values=[1, 2, 3], dtype=pl.Float64)
arr_f64 = numbers.to_numpy()
print(arr_f64.dtype) # float64
print(arr_f64.base) # <builtins.PySeries object at 0x7f65dc231710>
String dtypes: Converting to a specialized string dtype involves more work. The base being None confirms that an additional copy has occurred.
strings = pl.Series(values=['A', 'B', 'C'], dtype=pl.String)
dtype = np.dtypes.StringDType(na_object=np.nan)
arr_obj = strings.to_numpy()
print(arr_obj.dtype) # object
print(arr_obj.base) # <builtins.PySliceContainer object at 0x7f65ae7af7b0>
arr_str = strings.to_numpy().astype(dtype, copy=False)
print(arr_str.dtype) # StringDType(na_object=nan)
print(arr_str.base) # None
Specifiying dtypes in to_numpy could be useful for the interaction with heterogenous data types, too.
The interaction between structured NumPy arrays and Polars DataFrames is straightforward. You can switch between the structures using from_numpy and to_numpy. There are two main disadvantages:
import polars as pl
import numpy as np
numbers = [1, 2, 3]
strings = ['A', 'B', 'C']
# *** Structured Arrays ***
struct_dtype = np.dtype([('A', 'i8'), ('B', '<U1')])
col_1 = struct_dtype.names[0]
structured_array = np.array(list(zip(numbers, strings)), struct_dtype)
print('Layout of structured array:')
print(f"C-order: {structured_array.flags['C_CONTIGUOUS']}") # True
print(f"F-order: {structured_array.flags['F_CONTIGUOUS']}") # True
print('Layout of structured array columns:')
print(f"C-order: {structured_array[col_1].flags['C_CONTIGUOUS']}") # False
print(f"F-order: {structured_array[col_1].flags['F_CONTIGUOUS']}") # False
# Polars - NumPy interop
df = pl.from_numpy(structured_array)
new_structured_array = df.to_numpy(structured=True, allow_copy=True)
print(new_structured_array.base) # None
# new_structured_array = df.to_numpy(structured=True, allow_copy=False)
# RuntimeError: copy not allowed: cannot create structured array without copying data
A way to circumvent the shortcomings could be to use structured subarrays with contiguous columns. Unfortunately, there seems to be no explicit way to transform structured subarrays into a DataFrame and the DataFrame back into structured subarrays without some manual intervention.
# *** Structured Subarrays ***
subarray_dtype = np.dtype([('A', 'i8', (3,)), ('B', '<U1', (3,))])
col_1 = subarray_dtype.names[0]
struct_subarrays = np.array([(numbers, strings)], subarray_dtype)
print('Layout of structured subarrays:')
print(f"C-order: {struct_subarrays.flags['C_CONTIGUOUS']}") # True
print(f"F-order: {struct_subarrays.flags['F_CONTIGUOUS']}") # True
print('Layout of subarrays:')
print(f"C-order: {struct_subarrays[col_1].flags['C_CONTIGUOUS']}") # True
print(f"F-order: {struct_subarrays[col_1].flags['F_CONTIGUOUS']}") # True
# Polars - NumPy interop
df = pl.DataFrame(struct_subarrays).explode(subarray_dtype.names)
new_struct_subarrays = np.empty(1, subarray_dtype)
for col in subarray_dtype.names:
new_struct_subarrays[col] = df[col].to_numpy(allow_copy=True)
# new_struct_subarrays[col] = df[col].to_numpy(allow_copy=False)
# RuntimeError: copy not allowed: cannot convert to a NumPy array without copying data
print(new_struct_subarrays.base) # None
Description
The "to_numpy" method in Polars currently converts a Series or DataFrame to a NumPy array, but it doesn't allow specifying the desired NumPy dtype during conversion. Please enhance the "to_numpy" method to specify the resulting dtype.