Unclear which transformers version should be used when testing Tatoeba

vrmer commented 2 years ago

I installed all the necessary dependencies and tried running the Tatoeba task using bash scripts/train.sh "bert-base-multilingual-cased" tatoeba.

However, I immediately ran into an ImportError:

Traceback (most recent call last): File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 39, in <module> from bert import BertForRetrieval File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/bert.py", line 4, in <module> from transformers.modeling_bert import BertModel, BertPreTrainedModel ModuleNotFoundError: No module named 'transformers.modeling_bert'

This is with transformers-4.17.0.

I tried downgrading transformers to version 3.5 and 2.0 but I am running into other issues then.

Traceback (most recent call last): File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 43, in <module> from xlm_roberta import XLMRobertaConfig, XLMRobertaForRetrieval, XLMRobertaModel File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/xlm_roberta.py", line 24, in <module> from roberta import ( File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/roberta.py", line 27, in <module> from transformers.modeling_bert import BertEmbeddings, BertLayerNorm, BertModel, BertPreTrainedModel, gelu ImportError: cannot import name 'BertLayerNorm' from 'transformers.modeling_bert' (/Users/marcellfekete/miniforge3/envs/rosetta/lib/python3.8/site-packages/transformers/modeling_bert.py)

This is with transformers 3.5.

Traceback (most recent call last): File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 57, in <module> "xlmr": (XLMRobertaConfig, XLMRobertaModel, XLMRobertaTokenizer), NameError: name 'XLMRobertaTokenizer' is not defined

This is with transformers 2.0.

Do you have any advice? Which transformers version is recommended to run the tests?

I don't know if it matters but I am trying to run on Apple Silicon using the Rosetta layer (due to faiss not installing natively).

Thank you!

sebastianruder commented 2 years ago

Hi @vrmer, we used transformers==2.3.0 as far as I am aware. Did you try running the install_tools.sh script? This should install the correct transformer version (see this line).

vrmer commented 2 years ago

Thanks for the response! I ran into issues running the install_tools.sh script when I first started using the library but I don't have the output for that at the moment.

Nevertheless, I followed the lines you pointed at and install transformers==2.3.0. However, I still get the following errors:

Traceback (most recent call last): File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 57, in <module> "xlmr": (XLMRobertaConfig, XLMRobertaModel, XLMRobertaTokenizer), NameError: name 'XLMRobertaTokenizer' is not defined

sebastianruder commented 2 years ago

Hmm, that's strange. I just checked the HuggingFace Transformers repo and XLMRobertaTokenizer should be available in v2.3.0 (see here)? Could you double check that you have the correct version and if the file tokenization_xlm_roberta.py is available in the transformers version you are using?

vrmer commented 2 years ago

Apologies, apparently I uncommented the import statement when I was trying to make the code run and forgot to put it back!

Now the code starts running with this message:

03/16/2022 15:57:00 - INFO - root -   Input args: Namespace(batch_size=100, cache_dir='', candidate_prefix='candidates', concate_layers=False, config_name='', data_dir='/Users/marcellfekete/PycharmProjects/xtreme/download//tatoeba/', dist='cosine', do_lower_case=False, embed_size=768, encoding='utf-8', extract_embeds=False, gold=None, init_checkpoint=None, local_rank=-1, log_file='embed-cosine', max_answer_length=92, max_query_length=64, max_seq_length=512, mine_bitext=False, model_name_or_path='/mnt/disk-1/models/squad/xlm-roberta-large_LR3e-5_EPOCH2.0_maxlen384_batchsize2_gradacc16', model_type='bert', no_cuda=False, num_layers=12, output_dir='/Users/marcellfekete/PycharmProjects/xtreme/outputs-temp//tatoeba//mnt/disk-1/models/squad/xlm-roberta-large_LR3e-5_EPOCH2.0_maxlen384_batchsize2_gradacc16_512/', overwrite_cache=False, overwrite_output_dir=False, pool_skip_special_token=False, pool_type='mean', predict_dir=None, specific_layer=7, split='training', src_embed_file=None, src_file=None, src_id_file=None, src_language='ar', src_text_file=None, src_tok_file=None, task_name='tatoeba', tgt_embed_file=None, tgt_file=None, tgt_id_file=None, tgt_language='en', tgt_text_file=None, tgt_tok_file=None, threshold=-1, tokenizer_name='', unify=False, use_shift_embeds=False)

But then it gives me this error message:

Traceback (most recent call last):
  File "/Users/marcellfekete/miniforge3/envs/rosetta/lib/python3.8/site-packages/transformers/configuration_utils.py", line 204, in get_config_dict
    raise EnvironmentError
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 748, in <module>
    main()
  File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 733, in main
    all_src_embeds = extract_embeddings(args, src_text_file, src_tok_file, None, lang=src_lang2, pool_type=args.pool_type)
  File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 173, in extract_embeddings
    config, model, tokenizer, langid = load_model(args, lang,
  File "/Users/marcellfekete/PycharmProjects/xtreme/third_party/evaluate_retrieval.py", line 150, in load_model
    config = config_class.from_pretrained(args.model_name_or_path)
  File "/Users/marcellfekete/miniforge3/envs/rosetta/lib/python3.8/site-packages/transformers/configuration_utils.py", line 160, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/Users/marcellfekete/miniforge3/envs/rosetta/lib/python3.8/site-packages/transformers/configuration_utils.py", line 220, in get_config_dict
    raise EnvironmentError(msg)
OSError: Model name '/mnt/disk-1/models/squad/xlm-roberta-large_LR3e-5_EPOCH2.0_maxlen384_batchsize2_gradacc16' was not found in model name list. We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert//mnt/disk-1/models/squad/xlm-roberta-large_LR3e-5_EPOCH2.0_maxlen384_batchsize2_gradacc16/config.json' was a path, a model identifier, or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

I'm not even sure why it is trying to use XLM-RoBERTa when I explicitly tried using multilingual BERT.

sebastianruder commented 2 years ago

I assume you're running the run_tatoeba.sh script? We are now recommending to use a model fine-tuned on SQuAD for retrieval, rather than using the representations of the pre-trained model directly. In the run_tatoeba.sh script, you can replace the path to the fine-tuned model here. If you prefer not to use a fine-tuned model, you can simply uncomment that line and things should run as expected.

Edit: Running scripts/train.sh "bert-base-multilingual-cased" tatoeba calls the run_tatoeba.sh script.

vrmer commented 2 years ago

Oh thank you, that was actually really helpful! Now the code seems to be running without issues.

I am closing the issue because it has been sorted.

google-research / xtreme

Unclear which transformers version should be used when testing Tatoeba #85