tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.33k stars 1.56k forks source link

CroissantBuilder does not work on Windows machines #5546

Open zwouter opened 4 months ago

zwouter commented 4 months ago

Short description When using a simple example code snippet of the CroissantBuilder to load datasets using the croissant format, it only seems to work on Linux. The code snippet below correctly downloads and prepares a dataset on Collab, or WSL, but results in an error on Windows. All tested on a clean virtual environment.

Environment information

Reproduction instructions

import mlcroissant as mlc
import tensorflow_datasets as tfds

url = "https://huggingface.co/api/datasets/fashion_mnist/croissant"
builder = tfds.core.dataset_builders.CroissantBuilder(jsonld=url, file_format='array_record')
builder.download_and_prepare()

Link to logs https://pastebin.com/fRrfn8jj

Expected behavior A dataset builder is prepared such that I can use .as_data_source() later.

marcenacp commented 3 months ago

Hey @zwouter, thanks a lot for opening the issue!

I don't have access to a Windows machine. Can you help us investigate? From the logs, it seems to come from mlc not yielding any example from the default split:

AssertionError: Failed to finalize writing of split "default"No examples were yielded.

For some reasons, it tries to load the default split (but from the JSON-LD, it seems only the fashion_mnist split works.

import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
  print(x)

Thanks!

zwouter commented 3 months ago

Hi @marcenacp, thanks for the reply!

I have the latest versions of mlcroissant and tfds-nightly installed, I created a new virtual environment yesterday to test this.

That piece of code does not print anything if I run it on on Windows.

marcenacp commented 3 months ago

Weird!

Can you please try to delete all local caches? (Caches are located in ~/.cache/croissant for Croissant and ~/tensorflow_datasets for TFDS)

zwouter commented 3 months ago

Yess, just deleted the relevant chaches, same results.

marcenacp commented 3 months ago

Sorry, that was a blind guess as I cannot reproduce what happens in Windows. Could you please help us understand why the following snippet doesn't print anything?

import mlcroissant as mlc
url = "http://huggingface.co/api/datasets/fashion_mnist/croissant"
ds = mlc.Dataset(url)
for x in ds.records(record_set="fashion_mnist"):
  print(x)
  break

You can install mlcroissant in dev mode:

pip uninstall mlcroissant
git clone https://github.com/mlcommons/croissant
cd croissant/python/mlcroissant
pip install -e .[dev]

Adding prints/debug points in records and in sub functions should help you find something. The potential culprits could be:

For each source I gave you, you could debug the input/output to follow the data flow.

Thanks in advance for your help and contribution!

zwouter commented 3 months ago

No problem, I'm happy with any help I can get :)

And thanks for the resources! Unfortunately, I don't think I have the time to completely debug this right now. I might look into it further if I find some spare time in the coming weeks.