pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

`ComputeError` when reading a CSV with [square brackets] in file path #19801

Open aofarrel opened 5 hours ago

aofarrel commented 5 hours ago

Checks

Reproducible example

import polars as pl
tsv_1 = "./inputs/Merker_2022 (runindexed)/data.tsv"
tsv_2 = "./inputs/Merker_2022 [runindexed]/data.tsv"

# proves that the files are accessible and identical
print("Contents of tsv_1:")
print(open(tsv_1, "r").read())
print("Contents of tsv_2:")
print(open(tsv_2, "r").read())

print(pl.read_csv(tsv_1, separator='\t', ignore_errors=True))  # this works
print(pl.read_csv(tsv_2, separator='\t', ignore_errors=True))  # this throws the error

Log output

Contents of tsv_1:
run_index   geoloc_name date_collection
ERR108514   Russia: Samara  -
ERR108499   Russia: Samara  -
ERR234597   Russia: Samara  -
ERR133815   Russia: Samara  -
ERR067584   Russia: Samara  -
ERR133837   Russia: Samara  -
ERR067723   Russia: Samara  -
SRR1163081  Belarus 2011
SRR1162980  Belarus 2010
SRR1163178  Belarus 2009
SRR1162977  Belarus 2010
Contents of tsv_2:
run_index   geoloc_name date_collection
ERR108514   Russia: Samara  -
ERR108499   Russia: Samara  -
ERR234597   Russia: Samara  -
ERR133815   Russia: Samara  -
ERR067584   Russia: Samara  -
ERR133837   Russia: Samara  -
ERR067723   Russia: Samara  -
SRR1163081  Belarus 2011
SRR1162980  Belarus 2010
SRR1163178  Belarus 2009
SRR1162977  Belarus 2010
shape: (11, 3)
┌────────────┬────────────────┬─────────────────┐
│ run_index  ┆ geoloc_name    ┆ date_collection │
│ ---        ┆ ---            ┆ ---             │
│ str        ┆ str            ┆ str             │
╞════════════╪════════════════╪═════════════════╡
│ ERR108514  ┆ Russia: Samara ┆ -               │
│ ERR108499  ┆ Russia: Samara ┆ -               │
│ ERR234597  ┆ Russia: Samara ┆ -               │
│ ERR133815  ┆ Russia: Samara ┆ -               │
│ ERR067584  ┆ Russia: Samara ┆ -               │
│ …          ┆ …              ┆ …               │
│ ERR067723  ┆ Russia: Samara ┆ -               │
│ SRR1163081 ┆ Belarus        ┆ 2011            │
│ SRR1162980 ┆ Belarus        ┆ 2010            │
│ SRR1163178 ┆ Belarus        ┆ 2009            │
│ SRR1162977 ┆ Belarus        ┆ 2010            │
└────────────┴────────────────┴─────────────────┘
Traceback (most recent call last):
  File "/Users/aofarrel/github/ranchero/bug.py", line 12, in <module>
    print(pl.read_csv(tsv_2, separator='\t', ignore_errors=True))  # this throws the error
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/io/csv/functions.py", line 508, in read_csv
    df = _read_csv_impl(
         ^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/io/csv/functions.py", line 641, in _read_csv_impl
    return scan.collect()
           ^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2021, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: expected at least 1 source

Issue description

In my case data.tsv, identical at both paths, is a three-column valid TSV file, but this seems to happen on basically any TSV/CSV as long as there are square brackets in the path somewhere. File is attached (extension changed to .txt to keep GitHub happy)

data.tsv.txt

Expected behavior

The behavior of

print(pl.read_csv("./inputs/Merker_2022 (runindexed)/data.tsv", separator='\t', ignore_errors=True))

and

print(pl.read_csv("./inputs/Merker_2022 [runindexed]/data.tsv", separator='\t', ignore_errors=True))

should be identical, just like how they are identical when opening with standard python open(). If polars can't accept square brackets in a path, it should throw an error saying so when brackets are present, or just throw a file-not-found error.

Installed versions

``` --------Version info--------- Polars: 1.13.1 Index type: UInt32 Platform: macOS-13.6.7-x86_64-i386-64bit Python: 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib 3.8.2 nest_asyncio numpy 1.25.2 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
cmdlineluser commented 4 hours ago

It's due to [] being glob characters and glob=True being the default.

pl.DataFrame({"x": [1]}).write_csv("[foo].csv")

pl.read_csv("[foo].csv", glob=False)
# shape: (1, 1)
# ┌─────┐
# │ x   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘
pl.read_csv("[foo].csv")
# ComputeError: expected at least 1 source

I'm not sure if some sort of Hint: did you mean glob=False message could be added in the case when glob chars are present, but no files are matched?

aofarrel commented 3 hours ago

It's due to [] being glob characters and glob=True being the default.

pl.DataFrame({"x": [1]}).write_csv("[foo].csv")

pl.read_csv("[foo].csv", glob=False)
# shape: (1, 1)
# ┌─────┐
# │ x   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘
pl.read_csv("[foo].csv")
# ComputeError: expected at least 1 source

I'm not sure if some sort of Hint: did you mean glob=False message could be added in the case when glob chars are present, but no files are matched?

Oh, that explains it. Yeah, I think that kind of hint would work, or at least changing the error to a more straightforward file-not-found.