kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
94 stars 90 forks source link

ibis.FileDataset read files from web #918

Open mark-druffel opened 3 weeks ago

mark-druffel commented 3 weeks ago

Description

ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")

Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:

import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("https://raw.githubusercontent.com/seanwryan/DS210-Final-Project/refs/heads/main/spotify.csv")

However, when adding either file (hf:/ or raw.githubusercontent) as a FileDataset, my pipeline fails:

tracks:
  type: ibis.FileDataset
  filepath: hf://datasets/maharshipandya/spotify-tracks-dataset/dataset.csv
  file_format: csv
  connection: ${connection:spotify}
  load_args:
    sep: ","
  save_args:
    materialized: view
    overwrite: True

kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv":

The issue appears to be with _get_load_path().

Context

I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with _get_load_path(). Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.

Possible Implementation

I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...

Possible Alternatives

The easiest workaround is to just download the files.