tau-nlp / scrolls

The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".
https://www.scrolls-benchmark.com/
MIT License
68 stars 8 forks source link

Prediction for Qasper test data fails #8

Closed grndnl closed 1 year ago

grndnl commented 1 year ago

Hello,

I'm trying to replicate the fine-tuning results for the Qasper dataset baseline and the 256-bart model.

I see two issues when I try to generate predictions:

  1. When generating predictions on the Qasper validation data, there are only 984 samples loaded, instead of the 1,726 stated in the paper and found in the dataset itself. This is the command I'm running:
    python scripts/execute.py scripts/commands/generate.py qasper_256-bart_validation --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14
  2. When generating prediction for the test data, the script errors out, however there should be 1,399 examples:
    File "/home/ubuntu/baselines/src/run.py", line 689, in main
    id_to_prediction[instance["id"]] = predict_results.predictions[i]
    IndexError: index 984 is out of bounds for axis 0 with size 984

    This is the command I'm using:

    python scripts/execute.py scripts/commands/generate.py qasper_256-bart_test --checkpoint_path /home/ubuntu/baselines/outputs/facebook-bart-base_256_1_5e-05_16384_scrolls_qasper_site-wash-14

Could you please advise? Thanks!

eladsegal commented 1 year ago

Hi Daniele, can you please share the output of pip freeze?

grndnl commented 1 year ago

Thanks for the help, below is the output.

I want to clarify that the prediction on the validation set completes successfully, although the number of samples does not match (are some samples being dropped because they are too long to fit into the context of the 256-bart model?). I've looked at the dataset downloaded by the script in /.cache/huggingface/datasets/ and it has the correct number of samples. Also, fine-tuning on 256-bart model worked as expected, it seems.

absl-py==2.0.0
aiohttp==3.8.6
aiosignal==1.3.1
antlr4-python3-runtime==4.8
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.1.0
bitarray==2.8.2
certifi==2023.7.22
cffi==1.16.0
charset-normalizer==3.3.0
click==8.1.7
colorama==0.4.6
Cython==3.0.3
datasets==1.17.0
dill==0.3.7
docker-pycreds==0.4.0
fairseq==0.12.2
filelock==3.12.4
frozenlist==1.4.0
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.37
huggingface-hub==0.18.0
hydra-core==1.0.7
idna==3.4
importlib-resources==6.1.0
joblib==1.3.2
lxml==4.9.3
multidict==6.0.4
multiprocess==0.70.15
nltk==3.8.1
numpy==1.24.4
omegaconf==2.0.6
packaging==23.2
pandas==2.0.3
pathtools==0.1.2
plotly==5.3.1
portalocker==2.8.2
protobuf==4.24.4
psutil==5.9.5
pyarrow==13.0.0
pycparser==2.21
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
rouge-score==0.1.2
sacrebleu==2.3.1
sacremoses==0.0.53
sentencepiece==0.1.99
sentry-sdk==1.32.0
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
tabulate==0.9.0
tenacity==8.2.3
tokenizers==0.10.3
torch==1.9.0+cu111
torchaudio==0.9.0
tqdm==4.66.1
transformers @ git+http://github.com/eladsegal/public-transformers@839ed93a19dc344e72cd1afe1b604addc74040bd
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.6
wandb==0.15.12
xxhash==3.4.1
yarl==1.9.2
zipp==3.17.0
grndnl commented 1 year ago

I realized some answers to my two original questions:

  1. there are 984 samples in the Qasper validation after duplicates are dropped here. The validation dataset is processed by the preprocess_function here, and while the resulting eval_dataset contains 984 samples (correctly), I see the following warning, which I'm not able to understand:
    
    2023-10-17 04:45:09 | WARNING | datasets.fingerprint | Parameter 'function'=<function preprocess_function at 0x7fa38167d790> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
    2023-10-17 04:45:09 | WARNING | datasets.arrow_dataset | Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/tau___scrolls/qasper/1.0.0/672021d5d8e1edff998a6ea7a5bff35fdfd0ae243e7cf6a8c88a57a04afb46ac/cache-1c80317fa3b1799d.arrow

2.  For the test data, the deduplication function does not drop any data (I'm still seeing 1399 samples right after deduplication). However, the p[re-processing function for the test data ](https://github.com/tau-nlp/scrolls/blob/1fb1042e66fd005b76fc5ad4557d31ed2bab61c7/baselines/src/run.py#L597)seems to do something wrong, because the size of the test data afterwards is 984 (suspiciously the same as the validation data). I also see a similar warning again:

2023-10-17 04:45:36 | WARNING | datasets.fingerprint | Parameter 'function'=<function preprocess_function at 0x7f4917b06ca0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. 2023-10-17 04:45:36 | WARNING | datasets.arrow_dataset | Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/tau___scrolls/qasper/1.0.0/672021d5d8e1edff998a6ea7a5bff35fdfd0ae243e7cf6a8c88a57a04afb46ac/cache-1c80317fa3b1799d.arrow


As you can see, the two warning messages are loading the same cashed dataset.
By changing `load_from_cache_file=False` [here](https://github.com/tau-nlp/scrolls/blob/1fb1042e66fd005b76fc5ad4557d31ed2bab61c7/baselines/src/run.py#L603C27-L603C27), I now see 1399 predict samples.

-------------------------------------------------------------------------------------------------------------

This leaves question 1 answered, and question 2 temporarily fixed, but needing further inspection to understand why the preprocess_function does not load the proper cached predict dataset.
eladsegal commented 1 year ago

Hi Daniele, sorry for the delay in the response!

You understood correctly the removal of duplicate inputs.

Regarding the warning+error you got, I found the issue:
Version 1.17.0 of datasets did some modifications of dill that worked for version 0.3.4 but failed with newer versions, which made the cache fingerprinting to fail and as a result caused issues with the cache.

The fix would be to explicitly install the following dependencies:

dill==0.3.4
multiprocess==0.70.12.2  # (newer versions require dill>0.3.4)

I've also updated the repository accordingly. Thank you for bringing this to our attention!

grndnl commented 1 year ago

Thanks for the help!