huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.71k stars 2.59k forks source link

irc_disentangle - Issue with splitting data #6906

Open eor51355 opened 1 month ago

eor51355 commented 1 month ago

Describe the bug

I am trying to access your database through python using "datasets.load_dataset("irc_disentangle")" and I am getting this error message:

ValueError: Instruction "train" corresponds to no data!

Steps to reproduce the bug

import datasets ds = datasets.load_dataset('irc_disentangle') ds

Expected behavior

The data is supposed to load into ds and be accessable as such: ds['train'][1050], ds['train'][1055]

Environment info

I tired Python 3.12 and 3.10

eor51355 commented 2 weeks ago

Thank you I will try this out!

On Tue, Jun 11, 2024 at 3:55 AM Vincent Lau @.***> wrote:

I add a "streaming=True" after the name of the dataset, and it works.....hope it can help you

And if you install the version datasets==2.15.0, this bug will not happen. I don't know why, but all of them works

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6906#issuecomment-2160041812, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3HXU7AMBT2MNO34SC3Z5G3ZG2UOXAVCNFSM6AAAAABH45CNPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRQGA2DCOBRGI . You are receiving this because you authored the thread.Message ID: @.***>

cybest0608 commented 2 weeks ago

I still find out that there are some strange bug in v2.15.0 of datasets. it seems like that the *.arrow file cannot be established. it may be an index of the subsets. well I still try to debug it. but, one of the most efficient way may be using the google colab to build this index in the ~/huggingface/datasets, and than download them to replace the local file.....lol......it works!

eor51355 commented 2 weeks ago

Yeah I did try what you suggested and it didn’t work. I was able to get it on a local from someone who access the dataset in the past. Let me know when you end up fixing this bug.

On Tue, Jun 11, 2024 at 10:33 PM Vincent Lau @.***> wrote:

I still find out that there are some strange bug in v2.15.0 of datasets. it seems like that the *.arrow file cannot be established. it may be an index of the subsets. well I still try to debug it. but, one of the most efficient way may be using the google colab to build this index in the ~/huggingface/datasets, and than download them to replace the local file.....lol......it works!

— Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/6906#issuecomment-2161988798, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3HXU7BCJE2LOCWRVWPMNODZG6XPJAVCNFSM6AAAAABH45CNPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRHE4DQNZZHA . You are receiving this because you authored the thread.Message ID: @.***>