Dataset retriever fails when trying to download gated dataset

saum7800 commented 1 year ago

Here is the output from my prompt_spec:

Instruction:
Task description: Identify a broad class given several examples from that class

Examples:
input=
Q: The similarity among DemandBase, InfusionSoft, and HotSchedules is that they are all 
output=tech companies

input=
Q: The Architecture of Open Source Applications, Algorithms to Live By: The Computer Science of Human Decisions, and The Art of the Start: The Time-Tested, Battle-Hardened Guide for Anyone Starting Anything can be classified as 
output=Computer Science books

input=
Q: Wrike, SEMrush, and Sprinklr are all 
output=tech companies

Got the following error when trying to retrieve the dataset

FileNotFoundError: Couldn't find a dataset script at /projects/tir5/users/ssgandhi/prompt2model/bigbench/bigbench/BIG-bench/bigbench/imagenet-1k/imagenet-1k.py or any data file in the same directory. Couldn't find 'imagenet-1k' on the Hugging Face Hub either: FileNotFoundError: Dataset 'imagenet-1k' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

zhaochenyang20 commented 1 year ago

@viswavi

neubig commented 1 year ago

Hey @saum7800, I took a look at this and if you read the error, it says that the dataset may be private or gated. I looked at the specific dataset, and it seems that this is indeed the case: https://huggingface.co/datasets/imagenet-1k

There are two solutions to this:

Follow the instructions in the error message -- run the hugging face cli and get permission to use the data.
When you get this gated dataset error, gracefully proceed to using the next dataset.

"1." is a solution for this dataset, but you might always run into a new dataset that has problems, so I think "2." will need to be implemented. There are two ways that we could do this:

Simply write a for loop that steps over datasets (in the colab notebook and CLI?) and selects the next one any time the first one fails.
If there is a way to figure out if a dataset is gated through the hugging face API, we could indicate this in our metadata file.

Maybe we could just go with the first option for now.

saum7800 commented 1 year ago

Right, that makes sense!

In prompt2model_demo.py and .ipynb, user manually selects the dataset number/name. maybe it makes sense to inform the user that the dataset is gated and they should select another one from the retrieved datasets (assuming huggingface allows us to programatically know a dataset is gated). Does that sound right?

neubig commented 1 year ago

Yep. And in the worst case you could always catch the exception and programmatically parse the error message to see if it indicates that the model is gated.

ritugala commented 1 year ago

This will be resolved once reranking PR is merged!

neulab / prompt2model

Dataset retriever fails when trying to download gated dataset #371