neulab / awesome-align

A neural word aligner based on multilingual BERT
https://arxiv.org/abs/2101.08231
BSD 3-Clause "New" or "Revised" License
327 stars 47 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte #1

Closed echan00 closed 3 years ago

echan00 commented 3 years ago

Hello,

I tried to run alignments using the provided model (w/out train_co) and the example data (zhen.src-tgt), but am receiving an error as shown below:

DATA_FILE=./examples/zhen.src-tgt
MODEL_NAME_OR_PATH=./model_without_co/pytorch_model.bin
OUTPUT_FILE=./output/zhen.awesome-align.out

CUDA_VISIBLE_DEVICES=0 python3 run_align.py \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32 \

Traceback (most recent call last):
  File "run_align.py", line 194, in <module>
    main()
  File "run_align.py", line 167, in main
    config = config_class.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 175, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 227, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/Users/xxx/Downloads/awesome-align-master/configuration_utils.py", line 313, in _dict_from_json_file
    text = reader.read()
  File "/Users/xxx/.pyenv/versions/3.7.9/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
echan00 commented 3 years ago

Sorry, resolved now.

It should be the model path and not the model .bin file: MODEL_NAME_OR_PATH=./model_without_co