microsoft / TOXIGEN

This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
Other
271 stars 33 forks source link

Unable to Load Dataset #2

Closed aflah02 closed 2 years ago

aflah02 commented 2 years ago

Hey! Awesome Paper and codebase, it's very well documented!! I've ben facing some issues trying to load the dataset, I tried to load it on colab using the following lines -

from datasets import load_dataset
TG_data = load_dataset("skg/toxigen-data", name="train", use_auth_token=True) # 250k training examples
TG_annotations = load_dataset("skg/toxigen-data", name="annotated", use_auth_token=True) # Human study

I got the following error -


---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
[<ipython-input-2-f61e6e9de847>](https://localhost:8080/#) in <module>()
      1 from datasets import load_dataset
----> 2 TG_data = load_dataset("skg/toxigen-data", name="train", use_auth_token=True) # 250k training examples
      3 TG_annotations = load_dataset("skg/toxigen-data", name="annotated", use_auth_token=True) # Human study

7 frames
[/usr/local/lib/python3.7/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
    939 
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 
    943     def close(self):

HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/datasets/skg/toxigen-data

I suspect it's because of some authorization issues but I've filled the form and not quite sure what else should I do?

aflah02 commented 2 years ago

Nvm it turns out I only needed to replace it like this -

from datasets import load_dataset
TG_data = load_dataset("skg/toxigen-data", name="train", use_auth_token=Actual_Token) # 250k training examples
TG_annotations = load_dataset("skg/toxigen-data", name="annotated", use_auth_token=Actual_Token) # Human study
wzhings commented 2 years ago

@aflah02 What is the "Actual_Token" in your code?

aflah02 commented 2 years ago

@wzhings It's a Hugging Face Auth token. You can find how to get one here - https://huggingface.co/docs/hub/security-tokens

wzhings commented 2 years ago

@aflah02 Thank you for your reply. I created the auth_tokens, but I still got the following error HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/skg/toxigen-data I think I need to obtain a permission by filling the form, and then accessing the data.

Thartvigsen commented 2 years ago

Hi @wzhings, are you plugging in the security token when loading the data? Up above, it seems you can do it this way:

Actual_Token = "<YOUR_TOKEN_GOES_HERE>"
TG_data = load_dataset("skg/toxigen-data", name="train", use_auth_token=Actual_Token) # 250k training examples
TG_annotations = load_dataset("skg/toxigen-data", name="annotated", use_auth_token=Actual_Token) 

where Actual_Token is the token you got from the security-tokens page.

I personally didn't use this method, though, I used huggingface_cli. According to this page I think you can try:

pip install huggingface_hub from command line, then:

from huggingface_hub import notebook_login

notebook_login()

within python

wzhings commented 2 years ago

Hi @Thartvigsen, thank you for your information. I used the first method (i.e., Acutal_Token) and got the above error. Now I will try the second method you used. Thank you :)

aflah02 commented 2 years ago

@wzhings Do you still get the error after filling the form? That's strange because this worked for me lol

wzhings commented 2 years ago

Hi @aflah02, Yes, I still get the error after filling the form. I did not get any response after filling the form. I am not sure whether I need to wait for their permission.

aflah02 commented 2 years ago

Hey @wzhings I'm not sure if you need to wait for the permission but this is quite strange 🤔, I had also filled the form and then generated the token and it worked. I guess it could be the order maybe? Generating the tokens after filling the form? or maybe has to do with permissions only! Anyways if what @Thartvigsen suggested works you could just ignore all this that seems to be the better way

Thartvigsen commented 2 years ago

@wzhings you won't get a response after filling out the form, no need to wait on that to get access!

wzhings commented 2 years ago

Hello @Thartvigsen, I finally can access the dataset with the two above methods after filling the forms with different email accounts. Thank you.

Thartvigsen commented 2 years ago

@wzhings I am glad to hear that, thanks for letting me know!