pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.11k stars 1.94k forks source link

PanicException: index: 8449 out of bounds for len: 1 when using scan csv with schema and include_file_paths #18257

Closed djouallah closed 1 week ago

djouallah commented 2 months ago

Checks

Reproducible example

public notebook https://colab.research.google.com/drive/1XpcKetkpN86XXx-rEhPzdL9cEAHevtEH?usp=sharing

Log output

PanicException                            Traceback (most recent call last)
<ipython-input-8-e6e2c1b6eee6> in <cell line: 6>()
      4 list_files=[os.path.basename(x) for x in glob.glob(Source+'*.CSV')]
      5 files_to_upload_full_Path = [Source + i for i in list_files][:total_files]
----> 6 polars_clean_csv(files_to_upload_full_Path)

1 frames
/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, engine, background, _eager, **_kwargs)
   2025         # Only for testing purposes
   2026         callback = _kwargs.get("post_opt_callback", callback)
-> 2027         return wrap_df(ldf.collect(callback))
   2028 
   2029     @overload

PanicException: index: 8449 out of bounds for len: 1

Issue description

it should work

Expected behavior

it should work

Installed versions

1.5.0

djouallah commented 2 months ago

@ritchie46 with 1.6 i have a different bug

   2032         # Only for testing purposes
   2033         callback = _kwargs.get("post_opt_callback", callback)
-> 2034         return wrap_df(ldf.collect(callback))
   2035 
   2036     @overload

PanicException: index out of bounds: the len is 52 but the index is 52
nameexhaustion commented 2 months ago

@djouallah Can you provide a backtrace (set RUST_BACKTRACE=1 in the environment before importing polars)?

djouallah commented 2 months ago

@nameexhaustion i did provide a reproduced notebook in colab, just run it there ?

nameexhaustion commented 2 months ago

I'm not able to get the backtrace from the colab environment - could you try and make a small script that reproduces locally?

djouallah commented 1 month ago

sorry for that, even locally, I could not get a backtrace working

ritchie46 commented 2 weeks ago

@nameexhaustion it is related to include_file_paths.

I also see that include_file_paths doesn't respect the projection pushdown:

import polars as pl

q = pl.scan_csv("a,b,c\na1,b1,c1".encode(), include_file_paths="path_name").select(["a", "b"])

print(q.collect())
shape: (1, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ path_name │
│ --- ┆ --- ┆ ---       │
│ str ┆ str ┆ str       │
╞═════╪═════╪═══════════╡
│ a1  ┆ b1  ┆ in-mem    │
└─────┴─────┴───────────┘
nameexhaustion commented 1 week ago

The 2nd panic issue will be tracked at https://github.com/pola-rs/polars/issues/19397

djouallah commented 1 week ago

wonderful, it is fixed now !!!