Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset

xhluca commented 1 year ago

I have first used tevatron to train DPR from bert-based-uncased:

python -m torch.distributed.launch --nproc_per_node=1 -m tevatron.driver.train \
  --output_dir model_wq \
  --dataset_name Tevatron/wikipedia-wq \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --save_steps 20000 \
  --fp16 \
  --per_device_train_batch_size 128 \
  --train_n_passages 2 \
  --learning_rate 1e-5 \
  --q_max_len 32 \
  --p_max_len 156 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --overwrite_output_dir

After the model was saved to model_wq/ (see footnote), I continued to follow the instructions to encode the passages:

export ENCODE_DIR="wq_corpus_encoded"

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_wq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-wq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

I saved that inside a bash file and ran the bash file, but I multipleJSONDecodeError along the way, which does not seem to be expected (which is why I stopped the process):

$ bash encode_wq_corpus.sh 
mkdir: cannot create directory ‘wq_corpus_encoded’: File exists
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4573.94it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 429.92it/s]
Traceback (most recent call last):                                   
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1210, in _prepare_split
    desc=f"Generating {split_info.name} split",
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/tmp/.cache/huggingface/modules/datasets_modules/datasets/Tevatron--wikipedia-wq-corpus/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033/wikipedia-wq-corpus.py", line 82, in _generate_examples
    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 30 (char 29)
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5849.80it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 517.24it/s]
Traceback (most recent call last):                                  ^C
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1212, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1579, in encode_example
    return encode_nested_example(self, example)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1136, in encode_nested_example
    def encode_nested_example(schema, obj, level=0):
KeyboardInterrupt

Is this normal?

Libraries

This is my requirements file:

git+https://github.com/texttron/tevatron@b8f33900895930f9886012580e85464a5c1f7e9a
torch==1.12.*
faiss-cpu==1.7.2
transformers==4.15.0
datasets==1.17.0
pyserini

Footnote

I originally saved it as model_nq but renamed it to model_wq, I don't think this makes a difference but if it does let me know.
I also tested with wikipedia-nq and with both the latest version on master and also with the 0.1 version on pypi and I'm getting the same error.

MXueguang commented 1 year ago

Hi @xhluca, Sorry for the late reply. Is it just the issue of Tevatron/wikipedia-wq-corpus? Tevatron/wikipedia-nq-corpus also not works? It seems like a issue caused by the json environment?

    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)

let me know if you still having the issue

Xueguang

xhluca commented 1 year ago

I'm not sure what json environment means here. I'm using the standard python 3.7 library in a fresh virtualenv

xhluca commented 1 year ago

I tried different datasets and the problem seems to be present

MXueguang commented 1 year ago

Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda

xhluca commented 1 year ago

Yes, I tried the following example: https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects

ANd it works fine in my environment

xhluca commented 1 year ago

@MXueguang My bad, I was indeed using conda. However, do you think there should be a difference whether I"m using conda or virtualenv since the libraries were installed with pip and there's no conda-specific dependency?

texttron / tevatron

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

Libraries

Footnote