narwhals-dev / narwhals

Lightweight and extensible compatibility layer between dataframe libraries!
https://narwhals-dev.github.io/narwhals/
MIT License
613 stars 91 forks source link

[Bug]: TypeError when casting count result to int64 in Narwhals during TemporalScope integration #1427

Closed philip-ndikum closed 3 days ago

philip-ndikum commented 3 days ago

Describe the bug

We are currently integrating Narwhals into TemporalScope, an explainable AI (XAI) library designed for temporal feature importance analysis. Narwhals serves as the backend for supporting multiple DataFrame implementations (e.g., Pandas, Polars, Modin) in a unified, backend-agnostic manner.

During this integration, we encountered a TypeError when performing a row count operation. Specifically, when attempting to cast the result of count() to "int64" using Narwhals' backend-agnostic API in the _get_row_count method, the following error is raised:

TypeError: issubclass() arg 1 must be a class

This issue affects TemporalScope's SingleStepTargetShifter, which is responsible for shifting target variables in time series data as part of its fit/transform workflow. The error prevents proper row counting and disrupts TemporalScope's compatibility with Narwhals. It occurs consistently across all tested backends (e.g., Pandas, Modin, Polars).

Steps or code to reproduce the bug

To reproduce the issue, follow these steps:

  1. Open the public Google Colab notebook: Narwhals_TypeError_Reproduction.
  2. Run all the cells. The notebook creates a simple Pandas DataFrame, converts it to a Narwhals-compatible DataFrame, and performs a row count operation.
  3. Observe the TypeError raised during the casting step.

Code to reproduce:

import narwhals as nw
import pandas as pd

# Create a Pandas DataFrame
data = {"col1": [1, 2, 3, None], "col2": ["a", "b", "c", "d"]}
df_native = pd.DataFrame(data)

# Convert to Narwhals-compatible DataFrame
df_narwhals = nw.from_native(df_native)

# Perform row count operation
row_count_expr = nw.col("col1").count().cast("int64").alias("row_count")
row_count_result = df_narwhals.select([row_count_expr])

# Access the scalar result
row_count = row_count_result.item()
print("Row Count:", row_count)

Expected results

The row count operation should complete successfully, returning the number of rows cast to "int64". No error should occur.

Example expected output:

Row Count: 3

Actual results

The following error is raised during the execution of the row count operation:

Error encountered: TypeError: issubclass() arg 1 must be a class

Please Run narwhals.show_version()

Provide the output of this command:
```python
import narwhals as nw
print(nw.show_version())

### Please run narwhals.show_version() and enter the output below.

```shell
System:
    python: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.1.85+-x86_64-with-glibc2.35

Python dependencies:
     narwhals: 1.14.1
       pandas: 2.2.2
       polars: 1.9.0
         cudf: 
        modin: 
      pyarrow: 17.0.0
        numpy: 1.26.4

Relevant log output

Error encountered: issubclass() arg 1 must be a class
MarcoGorelli commented 3 days ago

thanks @philip-ndikum for your report!

the syntax is cast(nw.Int64)

However, the error message should be improved - will work on that, thanks 🙏