ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:
import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv")
Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:
import ibis
con = ibis.duckdb.connect()
tracks = con.read_csv("https://raw.githubusercontent.com/seanwryan/DS210-Final-Project/refs/heads/main/spotify.csv")
However, when adding either file (hf:/ or raw.githubusercontent) as a FileDataset, my pipeline fails:
kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv":
The issue appears to be with _get_load_path().
Context
I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with _get_load_path(). Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.
Possible Implementation
I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...
Possible Alternatives
The easiest workaround is to just download the files.
Description
ibis.FileDataset fails when trying to read files from certain domains (or all domains) on the web. With ibis, I can read data from hugging face:
Based on a conversation with @deepyaman, the above code might only work because duckdb has a hugging face extension. That said, for what it's worth I found a a github repo with the same file and tried to read it with ibis and it worked:
However, when adding either file (
hf:/
orraw.githubusercontent
) as a FileDataset, my pipeline fails:kedro.io.core.DatasetError: Failed while loading data from dataset FileDataset(backend=duckdb, file_format=csv, filepath=hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv, load_args={'sep': ,}, save_args={'materialized': view, 'overwrite': True}). IO Error: No files found that match the pattern "/spotify/hf:/datasets/maharshipandya/spotify-tracks-dataset/dataset.csv"
:The issue appears to be with
_get_load_path()
.Context
I don't fully understand the extent of the issue so it's hard to advocate for fixing it. Obviously, being able to read files from data lakes (s3, azure, etc.) is essential, but I think those already work with
_get_load_path()
. Having the ability to plug a url pointing to a file anywhere on the web could be really convenient, but I don't have a really solid use case for it at the moment.Possible Implementation
I'm not entirely sure this is valid, but since ibis seems to work on it's own in these examples perhaps ibis.FileDataset could use pass a raw path to ibis before failing...
Possible Alternatives
The easiest workaround is to just download the files.