run_language_modeling.py

Hello, I am facing issues running the run_language_modeling.py script when running the example for pretraining Bio_clinicalbert using this line: python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

issue 1 - it said the tokenizer did not have an argument called max_len. this was the error: 'AttributeError: 'BertTokenizerFast' object has no attribute 'max_len' ' Based on advice online, I updated It from 'tokenizer.max_len' to 'tokenizer.model_max_length' which seems to have resolved this issue

issue 2 - the current error message i am getting is: 'TypeError: init() got an unexpected keyword argument 'tui_ids''

When looking for answers to these online, I came across a comment on the huggingface transformers issue forum at https://github.com/huggingface/transformers/issues/8739 They said - 'It is actually due to #8604, where we removed several deprecated arguments. The run_languagemodeling.py script is deprecated in favor of language-modeling/run{clm, plm, mlm}.py.'

Does this apply to the scrpit for UmlsBERT as well? If so, how can I access the updated script? If not, how can I resolve the tui_ids issue?

I am running the scripts on google colab. This is the complete output I get when i run: python3 run_language_modeling.py --output_dir ./models/clinicalBert-v1 --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --mlm --do_train --learning_rate 5e-5 --max_steps 150000 --block_size 128 --save_steps 1000 --per_gpu_train_batch_size 32 --seed 42 --line_by_line --train_data_file mimic_string.txt --umls --config_name config.json --med_document ./voc/vocab_updated.txt

output:

2021-07-05 09:47:55.207129: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 07/05/2021 09:47:57 - WARNING - main - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False 07/05/2021 09:47:57 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=0, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=500, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, greater_is_better=None, group_by_length=False, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1/runs/Jul05_09-47-57_d7624bb0fdc5, logging_first_step=False, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=150000, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, output_dir=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=8, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=clinicalBert-v1, push_to_hub_organization=None, push_to_hub_token=None, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1, save_on_each_node=False, save_steps=1000, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) /usr/local/lib/python3.7/dist-packages/transformers/models/auto/modeling_auto.py:847: FutureWarning: The class AutoModelWithLMHead is deprecated and will be removed in a future version. Please use AutoModelForCausalLM for causal language models, AutoModelForMaskedLM for masked language models and AutoModelForSeq2SeqLM for encoder-decoder models. FutureWarning, Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']

This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Traceback (most recent call last): File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 355, in main() File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 248, in main tui_ids=tui_ids) if training_args.do_train else None File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_language_modeling.py", line 136, in get_dataset tui_ids=tui_ids) TypeError: init() got an unexpected keyword argument 'tui_ids'

P.S. I am not an expert programmer so do let me know if I should provide any further information as this is the first time i'm submitting an issue.

Thank you.

Best, Jaya

Update: I have found the replacement files (run_clm.py, run_mlm.py, run_plm.py). When running the run_clm_no_trainer.py (since I'm using mimic data to train), I get this error:

ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string'

Running on colab - This is the code:

!python3 'gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py' --output_dir 'gdrive/My Drive/UmlsBERT-master/language-modeling/models/clinicalBert-v1' --model_name_or_path emilyalsentzer/Bio_ClinicalBERT --learning_rate 5e-5 --block_size 128 --seed 42 --dataset_config_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/config.json' --dataset_name 'gdrive/My Drive/UmlsBERT-master/language-modeling/mimic_string.txt'

Here is the full output:

2021-07-06 10:08:00.087779: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 07/06/2021 10:08:01 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Use FP16 precision: False

Traceback (most recent call last): File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py", line 472, in main() File "gdrive/My Drive/UmlsBERT-master/language-modeling/run_clm_no_trainer.py", line 241, in main raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name) File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 838, in load_dataset **config_kwargs, File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 687, in load_dataset_builder builder_cls = import_main_class(module_path, dataset=True) File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 91, in import_main_class module = importlib.import_module(module_path) File "/usr/lib/python3.7/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 953, in _find_and_load_unlocked File "", line 219, in _call_with_frames_removed File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 953, in _find_and_load_unlocked File "", line 219, in _call_with_frames_removed File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 953, in _find_and_load_unlocked File "", line 219, in _call_with_frames_removed File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 953, in _find_and_load_unlocked File "", line 219, in _call_with_frames_removed File "", line 1006, in _gcd_import File "", line 983, in _find_and_load File "", line 965, in _find_and_load_unlocked ModuleNotFoundError: No module named 'datasets_modules.datasets.mimic_string'

gmichalo / UmlsBERT

run_language_modeling.py #4