Open zwouter opened 4 months ago
Hey @zwouter, thanks a lot for opening the issue!
I don't have access to a Windows machine. Can you help us investigate? From the logs, it seems to come from mlc
not yielding any example from the default
split:
AssertionError: Failed to finalize writing of split "default"No examples were yielded.
For some reasons, it tries to load the default
split (but from the JSON-LD, it seems only the fashion_mnist
split works.
import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
print(x)
Thanks!
Hi @marcenacp, thanks for the reply!
I have the latest versions of mlcroissant and tfds-nightly installed, I created a new virtual environment yesterday to test this.
That piece of code does not print anything if I run it on on Windows.
Weird!
Can you please try to delete all local caches? (Caches are located in ~/.cache/croissant
for Croissant and ~/tensorflow_datasets
for TFDS)
Yess, just deleted the relevant chaches, same results.
Sorry, that was a blind guess as I cannot reproduce what happens in Windows. Could you please help us understand why the following snippet doesn't print anything?
import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
print(x)
break
You can install mlcroissant in dev mode:
pip uninstall mlcroissant
git clone https://github.com/mlcommons/croissant
cd croissant/python/mlcroissant
pip install -e .[dev]
Adding prints/debug points in records
and in sub functions should help you find something. The potential culprits could be:
For each source
I gave you, you could debug the input/output to follow the data flow.
Thanks in advance for your help and contribution!
No problem, I'm happy with any help I can get :)
And thanks for the resources! Unfortunately, I don't think I have the time to completely debug this right now. I might look into it further if I find some spare time in the coming weeks.
Short description When using a simple example code snippet of the CroissantBuilder to load datasets using the croissant format, it only seems to work on Linux. The code snippet below correctly downloads and prepares a dataset on Collab, or WSL, but results in an error on Windows. All tested on a clean virtual environment.
Environment information
Operating System: Windows 11
Python version: 3.11.1
tensorflow-datasets
/tfds-nightly
version: tfds-nightly 4.9.6.dev202408050044tensorflow
/tf-nightly
version: tensorflow 2.17.0Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? YesReproduction instructions
Link to logs https://pastebin.com/fRrfn8jj
Expected behavior A dataset builder is prepared such that I can use .as_data_source() later.