huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.76k stars 26.73k forks source link

Stucked on tokenization before training when using 3 GPU, but not when using 2 GPU #24473

Closed higopires closed 1 year ago

higopires commented 1 year ago

System Info

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           Off| 00000000:52:00.0 Off |                    0 |
| N/A   55C    P0               80W / 300W|  59735MiB / 81920MiB |     12%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           Off| 00000000:CE:00.0 Off |                    0 |
| N/A   56C    P0               87W / 300W|  40933MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           Off| 00000000:D1:00.0 Off |                    0 |
| N/A   34C    P0               44W / 300W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Who can help?

@ArthurZucker @sgugger

Information

Tasks

Reproduction

I intend to use run_mlm.py to train RoBERTa from scratch. To the training, I'm using data created my myself, and I entered the following command:

CUDA_VISIBLE_DEVICES=0,1,2 python run_mlm.py \
    --model_type roberta \
    --config_overrides="num_hidden_layers=6,max_position_embeddings=514" \
    --tokenizer_name MyModel \
    --train_file ./data/corpus_dedup.txt \
    --max_seq_length 512 \
    --line_by_line True \
    --per_device_train_batch_size 64 \
    --do_train \
    --overwrite_output_dir True \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 40 \
    --fp16 True \
    --output_dir MyModel \
    --save_total_limit 1

When I try to do the training using a 3-GPU configuration, I'm getting stucked for dozens of hours in the tokenization before the training, with the following message:

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepadmethod to get a padded encoding.

Aditionally, when I try to do the training with only 2 GPU (CUDA_VISIBLE_DEVICES=0,1, followed by the same parameters), my training runs normally...

Expected behavior

Model starts to be trained from scratch on a 3 GPU configuration.

ArthurZucker commented 1 year ago

cc @Narsil you know more about me potential problems here (I remember a flag for tokenizer parallelism, might need to be set)

Narsil commented 1 year ago

This is very odd, since tokenizers doesn't use the GPU at all.

You could try using TOKENIZERS_PARALLELISM=0 CUDA_VISIBLE_DEVICE.... to disable the parallelism in tokenizers itself. There are ways to trigger a deadlock with using multithreading/processing with tokenizers from Python, but most of those should be catched. Note that this will slow down considerably the tokenizer training (it might already be what's occurring) since you're now only using 1 core instead of all the CPU.

And most importantly, the GPU settings shouldn't have any impact, so it looks like a bug in run_mlm.py parallelization strategy, or something wrong in the hardware.

Is it possible to isolate the tokenizers training from the rest of the code to sanity check things and see where the deadlock is coming from ?

higopires commented 1 year ago

This is very odd, since tokenizers doesn't use the GPU at all.

My bad. That's nvidia-smi with the training with the 2-GPU config already running. My intent with this was to show my hardware configuration and CUDA version.

You could try using TOKENIZERS_PARALLELISM=0 CUDA_VISIBLE_DEVICE.... to disable the parallelism in tokenizers itself.

Gonna try it right now.

Is it possible to isolate the tokenizers training from the rest of the code to sanity check things and see where the deadlock is coming from ?

I'm using a tokenizer that I trained beforehand (merges.txt and vocab.json files), so seems to me that the process is already isolated, isn't?

Narsil commented 1 year ago

Then it should load instantly and not even retrain a tokenizer, no ?

I'm not sure the message you shared is the cause of your issue (the warning is probably there, but it's just a hint that there's a faster way to encode data, not necessarily that this is what is making your process stuck.

higopires commented 1 year ago

Gonna try it right now.

Just did the process and came back here after a while: same issue:

[INFO|trainer.py:1680] 2023-06-26 13:43:56,492 >> ***** Running training *****
[INFO|trainer.py:1681] 2023-06-26 13:43:56,492 >>   Num examples = 2,353,535
[INFO|trainer.py:1682] 2023-06-26 13:43:56,492 >>   Num Epochs = 40
[INFO|trainer.py:1683] 2023-06-26 13:43:56,492 >>   Instantaneous batch size per device = 192
[INFO|trainer.py:1684] 2023-06-26 13:43:56,492 >>   Total train batch size (w. parallel, distributed & accumulation) = 768
[INFO|trainer.py:1685] 2023-06-26 13:43:56,493 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1686] 2023-06-26 13:43:56,493 >>   Total optimization steps = 122,560
[INFO|trainer.py:1687] 2023-06-26 13:43:56,493 >>   Number of trainable parameters = 82,170,969
[INFO|integrations.py:727] 2023-06-26 13:43:56,493 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: <USER>. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.4
wandb: Run data is saved locally in /cfs/home/u021274/higo/wandb/run-20230626_134359-d7jhdqpd
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fluent-forest-46
wandb: ⭐️ View project at <URL>
wandb: 🚀 View run at <URL>

  0%|                                                                                                                                                                                   | 

0/122560 [00:00<?, ?it/s][WARNING|logging.py:280] 2023-06-26 13:44:08,940 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

nvidia-smi returns the following:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           Off| 00000000:52:00.0 Off |                    0 |
| N/A   37C    P0               71W / 300W|   1885MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           Off| 00000000:CE:00.0 Off |                    0 |
| N/A   39C    P0               69W / 300W|   1863MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           Off| 00000000:D1:00.0 Off |                    0 |
| N/A   43C    P0               71W / 300W|   1863MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     62822      C   python                                     1882MiB |
|    1   N/A  N/A     62822      C   python                                     1860MiB |
|    2   N/A  N/A     62822      C   python                                     1860MiB |
+---------------------------------------------------------------------------------------+

Seems that's not the tokenization, because the GPU is (barely) used, but the message that I'm stucked remains the same.

Narsil commented 1 year ago

I would try putting a debugger in your session, and iterate step by step to figure out where the script hangs.

higopires commented 1 year ago
> /cfs/home/u021274/higo/run_mlm.py(234)main()
-> parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(235)main()
-> if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(240)main()
-> model_args, data_args, training_args = parser.parse_args_into_dataclasses()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(244)main()
-> send_example_telemetry("run_mlm", model_args, data_args)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(247)main()
-> logging.basicConfig(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(248)main()
-> format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(249)main()
-> datefmt="%m/%d/%Y %H:%M:%S",
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(250)main()
-> handlers=[logging.StreamHandler(sys.stdout)],
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(247)main()
-> logging.basicConfig(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(253)main()
-> if training_args.should_log:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(255)main()
-> transformers.utils.logging.set_verbosity_info()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(257)main()
-> log_level = training_args.get_process_log_level()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(258)main()
-> logger.setLevel(log_level)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(259)main()
-> datasets.utils.logging.set_verbosity(log_level)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(260)main()
-> transformers.utils.logging.set_verbosity(log_level)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(261)main()
-> transformers.utils.logging.enable_default_handler()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(262)main()
-> transformers.utils.logging.enable_explicit_format()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(265)main()
-> logger.warning(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(266)main()
-> f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(267)main()
-> + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(266)main()
-> f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(265)main()
-> logger.warning(
(Pdb) n
06/26/2023 19:45:08 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 3distributed training: True, 16-bits training: True
> /cfs/home/u021274/higo/run_mlm.py(270)main()
-> logger.info(f"Training/evaluation parameters {training_args}")
(Pdb) n
06/26/2023 19:45:09 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=3,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=MyModel/runs/Jun26_19-44-10_g07,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=40.0,
optim=adamw_hf,
optim_args=None,
output_dir=MyModel,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=64,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=MyModel,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=1,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
> /cfs/home/u021274/higo/run_mlm.py(273)main()
-> last_checkpoint = None
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(274)main()
-> if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(288)main()
-> set_seed(training_args.seed)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(299)main()
-> if data_args.dataset_name is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(326)main()
-> data_files = {}
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(327)main()
-> if data_args.train_file is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(328)main()
-> data_files["train"] = data_args.train_file
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(329)main()
-> extension = data_args.train_file.split(".")[-1]
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(330)main()
-> if data_args.validation_file is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(333)main()
-> if extension == "txt":
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(334)main()
-> extension = "text"
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(335)main()
-> raw_datasets = load_dataset(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(336)main()
-> extension,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(337)main()
-> data_files=data_files,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(338)main()
-> cache_dir=model_args.cache_dir,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(339)main()
-> use_auth_token=True if model_args.use_auth_token else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(335)main()
-> raw_datasets = load_dataset(
(Pdb) n
06/26/2023 19:45:33 - INFO - datasets.builder - Using custom data configuration default-2df3a67ae9ac7743
06/26/2023 19:45:33 - INFO - datasets.info - Loading Dataset Infos from /cfs/home/u021274/higo/myenv/lib64/python3.10/site-packages/datasets/packaged_modules/text
06/26/2023 19:45:33 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
06/26/2023 19:45:33 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
06/26/2023 19:45:34 - WARNING - datasets.builder - Found cached dataset text (/cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
06/26/2023 19:45:34 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.00it/s]
> /cfs/home/u021274/higo/run_mlm.py(343)main()
-> if "validation" not in raw_datasets.keys():
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(344)main()
-> raw_datasets["validation"] = load_dataset(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(345)main()
-> extension,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(346)main()
-> data_files=data_files,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(347)main()
-> split=f"train[:{data_args.validation_split_percentage}%]",
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(348)main()
-> cache_dir=model_args.cache_dir,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(349)main()
-> use_auth_token=True if model_args.use_auth_token else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(344)main()
-> raw_datasets["validation"] = load_dataset(
(Pdb) n
06/26/2023 19:45:52 - INFO - datasets.builder - Using custom data configuration default-2df3a67ae9ac7743
06/26/2023 19:45:52 - INFO - datasets.info - Loading Dataset Infos from /cfs/home/u021274/higo/myenv/lib64/python3.10/site-packages/datasets/packaged_modules/text
06/26/2023 19:45:52 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
06/26/2023 19:45:52 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
06/26/2023 19:45:52 - WARNING - datasets.builder - Found cached dataset text (/cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
06/26/2023 19:45:52 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
> /cfs/home/u021274/higo/run_mlm.py(351)main()
-> raw_datasets["train"] = load_dataset(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(352)main()
-> extension,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(353)main()
-> data_files=data_files,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(354)main()
-> split=f"train[{data_args.validation_split_percentage}%:]",
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(355)main()
-> cache_dir=model_args.cache_dir,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(356)main()
-> use_auth_token=True if model_args.use_auth_token else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(351)main()
-> raw_datasets["train"] = load_dataset(
(Pdb) n
06/26/2023 19:46:02 - INFO - datasets.builder - Using custom data configuration default-2df3a67ae9ac7743
06/26/2023 19:46:02 - INFO - datasets.info - Loading Dataset Infos from /cfs/home/u021274/higo/myenv/lib64/python3.10/site-packages/datasets/packaged_modules/text
06/26/2023 19:46:02 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists.
06/26/2023 19:46:02 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
06/26/2023 19:46:02 - WARNING - datasets.builder - Found cached dataset text (/cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
06/26/2023 19:46:02 - INFO - datasets.info - Loading Dataset info from /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2
> /cfs/home/u021274/higo/run_mlm.py(368)main()
-> "cache_dir": model_args.cache_dir,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(369)main()
-> "revision": model_args.model_revision,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(370)main()
-> "use_auth_token": True if model_args.use_auth_token else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(367)main()
-> config_kwargs = {
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(372)main()
-> if model_args.config_name:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(374)main()
-> elif model_args.model_name_or_path:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(377)main()
-> config = CONFIG_MAPPING[model_args.model_type]()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(378)main()
-> logger.warning("You are instantiating a new config instance from scratch.")
(Pdb) n
06/26/2023 19:46:14 - WARNING - __main__ - You are instantiating a new config instance from scratch.
> /cfs/home/u021274/higo/run_mlm.py(379)main()
-> if model_args.config_overrides is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(380)main()
-> logger.info(f"Overriding config: {model_args.config_overrides}")
(Pdb) n
06/26/2023 19:46:17 - INFO - __main__ - Overriding config: num_hidden_layers=6,max_position_embeddings=514
> /cfs/home/u021274/higo/run_mlm.py(381)main()
-> config.update_from_string(model_args.config_overrides)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(382)main()
-> logger.info(f"New config: {config}")
(Pdb) n
06/26/2023 19:46:19 - INFO - __main__ - New config: RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 50265
}

> /cfs/home/u021274/higo/run_mlm.py(385)main()
-> "cache_dir": model_args.cache_dir,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(386)main()
-> "use_fast": model_args.use_fast_tokenizer,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(387)main()
-> "revision": model_args.model_revision,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(388)main()
-> "use_auth_token": True if model_args.use_auth_token else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(384)main()
-> tokenizer_kwargs = {
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(390)main()
-> if model_args.tokenizer_name:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(391)main()
-> tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs)
(Pdb) n
[INFO|tokenization_auto.py:503] 2023-06-26 19:47:10,919 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:710] 2023-06-26 19:47:10,922 >> loading configuration file MyModel/config.json
[INFO|configuration_utils.py:768] 2023-06-26 19:47:10,932 >> Model config RobertaConfig {
  "_name_or_path": "MyModel",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file vocab.json
[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file merges.txt
[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1842] 2023-06-26 19:47:10,946 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:710] 2023-06-26 19:47:10,947 >> loading configuration file MyModel/config.json
[INFO|configuration_utils.py:768] 2023-06-26 19:47:10,950 >> Model config RobertaConfig {
  "_name_or_path": "MyModel",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

[INFO|configuration_utils.py:710] 2023-06-26 19:47:11,024 >> loading configuration file MyModel/config.json
[INFO|configuration_utils.py:768] 2023-06-26 19:47:11,027 >> Model config RobertaConfig {
  "_name_or_path": "MyModel",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

> /cfs/home/u021274/higo/run_mlm.py(400)main()
-> if model_args.model_name_or_path:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(411)main()
-> logger.info("Training new model from scratch")
(Pdb) n
06/26/2023 19:47:14 - INFO - __main__ - Training new model from scratch
> /cfs/home/u021274/higo/run_mlm.py(412)main()
-> model = AutoModelForMaskedLM.from_config(config)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(416)main()
-> embedding_size = model.get_input_embeddings().weight.shape[0]
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(417)main()
-> if len(tokenizer) > embedding_size:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(422)main()
-> if training_args.do_train:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(423)main()
-> column_names = list(raw_datasets["train"].features)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(426)main()
-> text_column_name = "text" if "text" in column_names else column_names[0]
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(428)main()
-> if data_args.max_seq_length is None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(438)main()
-> if data_args.max_seq_length > tokenizer.model_max_length:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(443)main()
-> max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length)
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(445)main()
-> if data_args.line_by_line:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(447)main()
-> padding = "max_length" if data_args.pad_to_max_length else False
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(449)main()
-> def tokenize_function(examples):
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(464)main()
-> with training_args.main_process_first(desc="dataset map tokenization"):
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(465)main()
-> if not data_args.streaming:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(466)main()
-> tokenized_datasets = raw_datasets.map(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(467)main()
-> tokenize_function,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(468)main()
-> batched=True,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(469)main()
-> num_proc=data_args.preprocessing_num_workers,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(470)main()
-> remove_columns=[text_column_name],
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(471)main()
-> load_from_cache_file=not data_args.overwrite_cache,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(472)main()
-> desc="Running tokenizer on dataset line_by_line",
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(466)main()
-> tokenized_datasets = raw_datasets.map(
(Pdb) n
06/26/2023 19:47:51 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-c8ae7ecb92d28874.arrow
06/26/2023 19:47:51 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /cfs/home/u021274/.cache/huggingface/datasets/text/default-2df3a67ae9ac7743/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2/cache-20fc928d1e2a7f3b.arrow
> /cfs/home/u021274/higo/run_mlm.py(464)main()
-> with training_args.main_process_first(desc="dataset map tokenization"):
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(542)main()
-> if training_args.do_train:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(543)main()
-> if "train" not in tokenized_datasets:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(545)main()
-> train_dataset = tokenized_datasets["train"]
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(546)main()
-> if data_args.max_train_samples is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(550)main()
-> if training_args.do_eval:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(580)main()
-> pad_to_multiple_of_8 = data_args.line_by_line and training_args.fp16 and not data_args.pad_to_max_length
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(581)main()
-> data_collator = DataCollatorForLanguageModeling(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(582)main()
-> tokenizer=tokenizer,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(583)main()
-> mlm_probability=data_args.mlm_probability,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(584)main()
-> pad_to_multiple_of=8 if pad_to_multiple_of_8 else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(581)main()
-> data_collator = DataCollatorForLanguageModeling(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(588)main()
-> trainer = Trainer(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(589)main()
-> model=model,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(590)main()
-> args=training_args,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(591)main()
-> train_dataset=train_dataset if training_args.do_train else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(592)main()
-> eval_dataset=eval_dataset if training_args.do_eval else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(593)main()
-> tokenizer=tokenizer,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(594)main()
-> data_collator=data_collator,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(595)main()
-> compute_metrics=compute_metrics if training_args.do_eval and not is_torch_tpu_available() else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(597)main()
-> if training_args.do_eval and not is_torch_tpu_available()
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(596)main()
-> preprocess_logits_for_metrics=preprocess_logits_for_metrics
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(598)main()
-> else None,
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(588)main()
-> trainer = Trainer(
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(602)main()
-> if training_args.do_train:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(603)main()
-> checkpoint = None
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(604)main()
-> if training_args.resume_from_checkpoint is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(606)main()
-> elif last_checkpoint is not None:
(Pdb) n
> /cfs/home/u021274/higo/run_mlm.py(608)main()
-> train_result = trainer.train(resume_from_checkpoint=checkpoint)
(Pdb) n
[INFO|trainer.py:769] 2023-06-26 19:48:46,054 >> The following columns in the training set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: special_tokens_mask. If special_tokens_mask are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
/cfs/home/u021274/higo/myenv/lib64/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1680] 2023-06-26 19:48:46,071 >> ***** Running training *****
[INFO|trainer.py:1681] 2023-06-26 19:48:46,071 >>   Num examples = 2,353,535
[INFO|trainer.py:1682] 2023-06-26 19:48:46,071 >>   Num Epochs = 40
[INFO|trainer.py:1683] 2023-06-26 19:48:46,071 >>   Instantaneous batch size per device = 192
[INFO|trainer.py:1684] 2023-06-26 19:48:46,071 >>   Total train batch size (w. parallel, distributed & accumulation) = 768
[INFO|trainer.py:1685] 2023-06-26 19:48:46,071 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1686] 2023-06-26 19:48:46,071 >>   Total optimization steps = 122,560
[INFO|trainer.py:1687] 2023-06-26 19:48:46,074 >>   Number of trainable parameters = 82,170,969
[INFO|integrations.py:727] 2023-06-26 19:48:46,077 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: <USER>. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.4
wandb: Run data is saved locally in /cfs/home/u021274/higo/wandb/run-20230626_194847-vr14588a
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run fragrant-universe-48
wandb: ⭐️ View project at <URL>
wandb: 🚀 View run at <URL>
  0%|                                                                                                                                                                           | 0/122560 [00:00<?, ?it/s][WARNING|logging.py:280] 2023-06-26 19:49:01,837 >> You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Narsil commented 1 year ago

Then its doesn't seem link in any way to the tokenization, you would need to step into the train function to know more.

higopires commented 1 year ago

I see. How can I do this? Any suggestions? I'm kinda of new on it, and I don't know how to start searching the real problem inside the train function.

Narsil commented 1 year ago

Ask around in discord https://discuss.huggingface.co/t/join-the-hugging-face-discord/11263 or the forum https://discuss.huggingface.co/

You might be able to find better help for such things.

I'm closing this issue, feel free to reopen one, when you have narrowed down what's going on.