Issues with poetry install and generate.py script

gabriellestein commented 1 year ago

Hello I am attempting to replicate the experiment done in your EMNLP22 paper. I am running this in a VM. I cloned this repo and followed the instructions on GitHub, but encountered the following issues:

The first issue was with poetry install, which failed with the error message: "Can not execute setup.py since setuptools is not available in the build environment."

I resolved this issue by running ".../bin/pip install setuptools" and updating pyproject.toml to requires = ["setuptools", "poetry_core>=1.4"].

After updating pyproject.toml, poetry install failed with the error message: "/emnlp22-transforming/autoregressive_paraphrasing does not contain any element."

To resolve this issue, I ran "wget -O autoregressive_paraphrasing https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset/resolve/main/dataset.tsv".

With these changes, poetry install returned:

"Installing dependencies from lock file No dependencies to install or update Installing the current project: autoregressive-paraphrasing (0.1.0)".

However, when I ran "poetry run python -m paraphrase.generate --model_name_or_path t5 --input_file input.txt --output_file output.txt --prompts prompts.txt", I encountered the error message:

"generate.py: error: unrecognized arguments: --model_name_or_path t5 --input_file input.txt --output_file output.txt --prompts prompts.txt".

I would appreciate any guidance on how to resolve this issue with the generate.py script. Thank you.

jpwahle commented 1 year ago

Thanks for reaching out. I will try to help you as best as I can to get the repository running.

When I open the project in a GitHub Codespace, the poetry install works just fine. One way to fix things with poetry is to remove the poetry.lock and let it resolve the dependencies again. Also make sure you are satisfying the python version of the pyproject.toml file which has to be at least 3.9. Try a different python version if the issues still persist.

About the input arguments, you are totally right. We updated that. To generate new examples, you can either use a HF dataset and change it here:

https://github.com/jpwahle/emnlp22-transforming/blob/main/paraphrase/generate.py#L200-L202

Or you can implement a small loader function to load the "originals" locally.

I hope that helps

Also, feel free to make a PR if any of the changes affect a specific python version etc.

gabriellestein commented 1 year ago

Hello I was able to get the program to run but ran into an issue with the dataset loading. Do you have any advice?

emnlp22-transforming$ poetry run python paraphrase/generate.py --model_name gpt3 --num_prompts 4 --num_examples 32
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 12.3MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████| 456k/456k [00:00<00:00, 5.68MB/s]
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████| 665/665 [00:00<00:00, 240kB/s]
Found cached dataset csv (.cache/huggingface/datasets/jpwahle___csv/jpwahle--autoencoder-paraphrase-dataset-62c0d44fc5d69b69/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
100%|████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.04it/s]
Loading cached processed dataset at .cache/huggingface/datasets/jpwahle___csv/jpwahle--autoencoder-paraphrase-dataset-62c0d44fc5d69b69/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-7b7f45ca046b5e11.arrow
Loading cached processed dataset at .cache/huggingface/datasets/jpwahle___csv/jpwahle--autoencoder-paraphrase-dataset-62c0d44fc5d69b69/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-a1c931b934775176.arrow
Traceback (most recent call last):
  File "emnlp22-transforming/paraphrase/generate.py", line 233, in <module>
    main()
  File "emnlp22-transforming/paraphrase/generate.py", line 205, in main
    load_dataset(args.paraphrase_dataset)
  File ".cache/pypoetry/virtualenvs/autoregressive-paraphrasing-f_g9mVfZ-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1767, in load_dataset
    builder_instance = load_dataset_builder(
  File ".cache/pypoetry/virtualenvs/autoregressive-paraphrasing-f_g9mVfZ-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1498, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File ".cache/pypoetry/virtualenvs/autoregressive-paraphrasing-f_g9mVfZ-py3.10/lib/python3.10/site-packages/datasets/load.py", line 1211, in dataset_module_factory
    raise FileNotFoundError(
FileNotFoundError: Couldn't find a dataset script at emnlp22-transforming/mrpc/mrpc.py or any data file in the same directory. Couldn't find 'mrpc' on the Hugging Face Hub either: FileNotFoundError: Dataset 'mrpc' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with `huggingface-cli login`.

I am very new to using huggingface or similar so I apologize if this seems obvious, but does this mean that the hugging face dataset isn't linked correctly in the code? I thought it was public so I wouldn't need to login, though I did try login in to huggingface and got the same error. Thank you.

jpwahle commented 1 year ago

Can you try ´load_dataset("glue", "mrpc")` in this line? We used a private dataset to load and just added mrpc here as an example.

gabriellestein commented 1 year ago

Hello we get this error using that code.

Traceback (most recent call last):
  File "/emnlp22-transforming/paraphrase/generate.py", line 233, in <module>
    main()
  File "/emnlp22-transforming/paraphrase/generate.py", line 205, in main
    load_dataset("glue", "mrpc")
  File "/.cache/pypoetry/virtualenvs/autoregressive-paraphrasing-f_g9mVfZ-py3.10/lib/python3.10/site-packages/datasets/dataset_dict.py", line 63, in __getitem__
    raise KeyError(KeyError: "Invalid key: slice(None, 100, None). Please first select a split. For example: `my_dataset_dictionary['train'][slice(None, 100, None)]`. Available splits: ['test', 'train', 'validation']"

This seems to be files generated by hugging face for the dataset so maybe there is something missing from the load_dataset command that calls those files?

jpwahle commented 1 year ago

Sorry for the easter delay. It seems that part has changed in HF. Just remove the [:100] there and look if it works. Hope everything runs smooth now, let me know if the issue still persists.

jpwahle / emnlp22-transforming

Issues with poetry install and generate.py script #3