huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Older datasets throwing safety errors with 2.21.0 #7141

Closed alvations closed 2 months ago

alvations commented 2 months ago

Describe the bug

The dataset loading was throwing some safety errors for this popular dataset wmt14.

[in]:

import datasets

# train_data = datasets.load_dataset("wmt14", "de-en", split="train")
train_data = datasets.load_dataset("wmt14", "de-en", split="train")
val_data = datasets.load_dataset("wmt14", "de-en", split="validation[:10%]")

[out]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-9-445f0ecc4817>](https://localhost:8080/#) in <cell line: 4>()
      2 
      3 # train_data = datasets.load_dataset("wmt14", "de-en", split="train")
----> 4 train_data = datasets.load_dataset("wmt14", "de-en", split="train")
      5 val_data = datasets.load_dataset("wmt14", "de-en", split="validation[:10%]")

12 frames
[/usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py](https://localhost:8080/#) in __init__(self, **kwargs)
    636         if security is not None:
    637             security = BlobSecurityInfo(
--> 638                 safe=security["safe"], av_scan=security["avScan"], pickle_import_scan=security["pickleImportScan"]
    639             )
    640         self.security = security

KeyError: 'safe'

Steps to reproduce the bug

See above.

Expected behavior

Dataset properly loaded.

Environment info

version: 2.21.0

adil-a commented 2 months ago

I am also getting this error with this dataset: https://huggingface.co/datasets/google/IFEval

adrianb92 commented 2 months ago

Me too, didn't have this issue few hours ago.

Vipitis commented 2 months ago

same observation. I even downgraded datasets==2.20.0 and huggingface_hub==0.23.5 leading me to believe it's an issue on the server.

any known workarounds?

alvations commented 2 months ago

Not a good idea, but commenting out the whole security block at /usr/local/lib/python3.10/dist-packages/huggingface_hub/hf_api.py is a temporary workaround:

        #security = kwargs.pop("security", None)
        #if security is not None:
        #    security = BlobSecurityInfo(
        #        safe=security["safe"], av_scan=security["avScan"], pickle_import_scan=security["pickleImportScan"]
        #    )
        #self.security = security
omar93939 commented 2 months ago

Uploading a dataset to Huggingface also results in the following error in the Dataset Preview:

The full dataset viewer is not available (click to read why). Only showing a preview of the rows.
'safe'
Error code:   UnexpectedError
Need help to make the dataset viewer work? Make sure to review [how to configure the dataset viewer](link1), and [open a discussion](link2) for direct support.

I used jsonl format for the dataset in this case. Same exact dataset worked previously.

soldni commented 2 months ago

Same issue here. Even reverting to older version of datasets (e.g., 2.19.0) results in same error:

>>> datasets.load_dataset('allenai/ai2_arc', 'ARC-Easy')

File "/Users/lucas/miniforge3/envs/oe-eval-internal/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3048, in <listcomp>
    RepoFile(**path_info) if path_info["type"] == "file" else RepoFolder(**path_info)
  File "/Users/lucas/miniforge3/envs/oe-eval-internal/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 534, in __init__
    safe=security["safe"], av_scan=security["avScan"], pickle_import_scan=security["pickleImportScan"]
KeyError: 'safe'
FotieMConstant commented 2 months ago

i just had this issue a few minutes ago, crawled the internet and found nothing. came here to open an issue and found this. it is really frustrating. anyone found a fix?

ummagumm-a commented 2 months ago

hi, me and my team have the same problem

JonasGeiping commented 2 months ago

Yeah, this just suddenly appeared without client-side code changes, within the last hours.

Here's a patch to fix the issue temporarily:

import huggingface_hub
def patched_repofolder_init(self, **kwargs):
    self.path = kwargs.pop("path")
    self.tree_id = kwargs.pop("oid")
    last_commit = kwargs.pop("lastCommit", None) or kwargs.pop("last_commit", None)
    if last_commit is not None:
        last_commit = huggingface_hub.hf_api.LastCommitInfo(
            oid=last_commit["id"],
            title=last_commit["title"],
            date=huggingface_hub.utils.parse_datetime(last_commit["date"]),
        )
    self.last_commit = last_commit

def patched_repo_file_init(self, **kwargs):
    self.path = kwargs.pop("path")
    self.size = kwargs.pop("size")
    self.blob_id = kwargs.pop("oid")
    lfs = kwargs.pop("lfs", None)
    if lfs is not None:
        lfs = huggingface_hub.hf_api.BlobLfsInfo(size=lfs["size"], sha256=lfs["oid"], pointer_size=lfs["pointerSize"])
    self.lfs = lfs
    last_commit = kwargs.pop("lastCommit", None) or kwargs.pop("last_commit", None)
    if last_commit is not None:
        last_commit = huggingface_hub.hf_api.LastCommitInfo(
            oid=last_commit["id"],
            title=last_commit["title"],
            date=huggingface_hub.utils.parse_datetime(last_commit["date"]),
        )
    self.last_commit = last_commit
    self.security = None

    # backwards compatibility
    self.rfilename = self.path
    self.lastCommit = self.last_commit

huggingface_hub.hf_api.RepoFile.__init__ = patched_repo_file_init
huggingface_hub.hf_api.RepoFolder.__init__ = patched_repofolder_init
neoneye commented 2 months ago

Also discussed here: https://discuss.huggingface.co/t/i-keep-getting-keyerror-safe-when-loading-my-datasets/105669/1

FotieMConstant commented 2 months ago

i'm thinking this should be a server issue, i mean no client code was changed on my end. so weird!

lebrice commented 2 months ago

As far as I can tell, this seems to be happening with all datasets that use RepoFolder (probably represents most datasets on huggingface, right?)

FotieMConstant commented 2 months ago

Here is a temporary fix for the problem: https://discuss.huggingface.co/t/i-keep-getting-keyerror-safe-when-loading-my-datasets/105669/12?u=mlscientist

this doesn't seem to work!

adrianb92 commented 2 months ago

In case you are using Colab or similar, remember to restart your session after modyfing the hf_api.py file

JonasGeiping commented 2 months ago

No need to modify the file directly, just monkey-patch.

I'm now more sure that the error appears because the backend expects the api code to look like it does on main. If RepoFile and RepoFolder look about like they look on main, they work again.

If not fixed like above, a secondary error that will appear is

    return self.info(path, expand_info=False)["type"] == "directory"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    "tree_id": path_info.tree_id,
               ^^^^^^^^^^^^^^^^^
AttributeError: 'RepoFolder' object has no attribute 'tree_id'
muellerzr commented 2 months ago

We've reverted the deployment, please let us know if the issue still persists!

ajstarna commented 2 months ago

thanks @muellerzr!