huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.63k stars 441 forks source link

How to a Autotrain Seq2Seq ? #376

Closed Lachkar-Ahmed-Salim closed 7 months ago

Lachkar-Ahmed-Salim commented 7 months ago

Hi everyone , I'm trying to finetune a Helsinki-NLP/opus-mt-tc-big-ar-en on local arabic of morocco which is called Daraija Arabic , the problem is that I'm unable to use Autotrain I keep getting 500 error code Screenshot 2023-12-07 011848 Screenshot 2023-12-07 011912 output.csv FYI : I didnt modify Training Parameters (find params to copy-paste [here] area so I dont know if its necessary

Lachkar-Ahmed-Salim commented 7 months ago

===== Application Startup at 2023-12-06 22:31:45 =====

========== == CUDA ==

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .

INFO: Will watch for changes in these directories: ['/app'] WARNING: "workers" flag is ignored when reloading is enabled. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) INFO: Started reloader process [35] using StatReload

INFO Authenticating user... WARNING Parameters not supplied by user and set to default: merge_adapter, token, repo_id, lr, evaluation_strategy, logging_steps, save_total_limit, weight_decay, model_max_length, model_ref, batch_size, trainer, max_grad_norm, project_name, username, lora_alpha, lora_r, valid_split, warmup_ratio, train_split, lora_dropout, data_path, optimizer, dpo_beta, save_strategy, model, disable_gradient_checkpointing, seed, auto_find_batch_size, text_column, gradient_accumulation, use_flash_attention_2, prompt_text_column, push_to_hub, add_eos_token, rejected_text_column, scheduler WARNING Parameters not supplied by user and set to default: scheduler, train_split, token, target_column, data_path, optimizer, repo_id, lr, max_seq_length, fp16, save_total_limit, logging_steps, evaluation_strategy, epochs, weight_decay, save_strategy, log, batch_size, model, seed, auto_find_batch_size, max_grad_norm, text_column, gradient_accumulation, username, project_name, push_to_hub, valid_split, warmup_ratio WARNING Parameters not supplied by user and set to default: scheduler, train_split, token, target_column, data_path, optimizer, repo_id, lr, fp16, save_total_limit, logging_steps, evaluation_strategy, epochs, weight_decay, save_strategy, log, batch_size, model, seed, auto_find_batch_size, max_grad_norm, gradient_accumulation, project_name, image_column, push_to_hub, valid_split, warmup_ratio WARNING Parameters not supplied by user and set to default: token, repo_id, validation_prompt, xformers, epochs, rank, resume_from_checkpoint, train_text_encoder, class_image_path, lr_power, adam_weight_decay, max_grad_norm, num_validation_images, tokenizer, adam_beta1, resolution, allow_tf32, local_rank, username, project_name, tokenizer_max_length, prior_generation_precision, pre_compute_text_embeddings, num_cycles, num_class_images, dataloader_num_workers, text_encoder_use_attention_mask, center_crop, adam_beta2, revision, warmup_steps, checkpoints_total_limit, validation_images, bf16, prior_loss_weight, image_path, model, sample_batch_size, seed, class_labels_conditioning, xl, checkpointing_steps, validation_epochs, logging, class_prompt, prior_preservation, push_to_hub, scale_lr, adam_epsilon, scheduler, use_8bit_adam WARNING Parameters not supplied by user and set to default: token, repo_id, lr, max_seq_length, fp16, evaluation_strategy, logging_steps, save_total_limit, epochs, weight_decay, batch_size, max_grad_norm, username, project_name, lora_alpha, lora_r, valid_split, warmup_ratio, train_split, target_column, max_target_length, use_int8, data_path, optimizer, lora_dropout, save_strategy, use_peft, target_modules, model, seed, auto_find_batch_size, text_column, gradient_accumulation, push_to_hub, scheduler WARNING Parameters not supplied by user and set to default: train_split, token, id_column, categorical_imputer, data_path, repo_id, time_limit, numerical_columns, model, seed, task, categorical_columns, numerical_imputer, numeric_scaler, target_columns, username, project_name, num_trials, push_to_hub, valid_split INFO: Started server process [37] INFO: Waiting for application startup. INFO: Application startup complete. INFO: 10.16.20.172:53634 - "GET / HTTP/1.1" 200 OK INFO: 10.16.20.172:53634 - "GET /logo.png HTTP/1.1" 200 OK INFO Task: llm:sft INFO: 10.16.18.44:39542 - "GET /params/llm%3Asft HTTP/1.1" 200 OK INFO hardware: A10G Large INFO: 10.16.18.44:25742 - "POST /create_project HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi result = await app( # type: ignore[func-returns-value] File "/app/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call await super().call(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/app/src/autotrain/app.py", line 248, in handle_form dset = AutoTrainDataset( File "", line 12, in init File "/app/src/autotrain/dataset.py", line 198, in __post_init__ self.train_df, self.valid_df = self._preprocess_data() File "/app/src/autotrain/dataset.py", line 207, in _preprocess_data train_df.append(pd.read_csv(file)) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read return parser.read(nrows) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read ) = self._engine.read( # type: ignore[attr-defined] File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 3

INFO Task: seq2seq INFO: 10.16.18.44:16455 - "GET /params/seq2seq HTTP/1.1" 200 OK INFO hardware: A10G Large INFO: 10.16.20.172:47809 - "POST /create_project HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi result = await app( # type: ignore[func-returns-value] File "/app/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call await super().call(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/app/src/autotrain/app.py", line 273, in handle_form dset = AutoTrainDataset( File "", line 12, in init File "/app/src/autotrain/dataset.py", line 198, in __post_init__ self.train_df, self.valid_df = self._preprocess_data() File "/app/src/autotrain/dataset.py", line 207, in _preprocess_data train_df.append(pd.read_csv(file)) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read return parser.read(nrows) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read ) = self._engine.read( # type: ignore[attr-defined] File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 3

INFO: 10.16.20.172:17946 - "GET / HTTP/1.1" 200 OK INFO: 10.16.20.172:17946 - "GET /logo.png HTTP/1.1" 200 OK

INFO Task: llm:sft INFO: 10.16.20.172:17946 - "GET /params/llm%3Asft HTTP/1.1" 200 OK INFO hardware: A10G Large INFO: 10.16.20.172:46007 - "POST /create_project HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 428, in run_asgi result = await app( # type: ignore[func-returns-value] File "/app/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in call return await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in call await super().call(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/applications.py", line 122, in call await self.middleware_stack(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in call await self.app(scope, receive, _send) File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in call raise exc File "/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in call await self.app(scope, receive, sender) File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in call raise e File "/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in call await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 718, in call await route.handle(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle await self.app(scope, receive, send) File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 66, in app response = await func(request) File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 274, in app raw_response = await run_endpoint_function( File "/app/env/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/app/src/autotrain/app.py", line 273, in handle_form dset = AutoTrainDataset( File "", line 12, in init File "/app/src/autotrain/dataset.py", line 198, in __post_init__ self.train_df, self.valid_df = self._preprocess_data() File "/app/src/autotrain/dataset.py", line 207, in _preprocess_data train_df.append(pd.read_csv(file)) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv return _read(filepath_or_buffer, kwds) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 617, in _read return parser.read(nrows) File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1748, in read ) = self._engine.read( # type: ignore[attr-defined] File "/app/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 3

INFO: 10.16.20.172:57414 - "GET /robots.txt HTTP/1.1" 404 Not Found INFO: 10.16.18.44:7777 - "GET / HTTP/1.1" 200 OK INFO: 10.16.20.172:1615 - "GET /logo.png HTTP/1.1" 200 OK INFO: 10.16.20.172:49738 - "GET / HTTP/1.1" 200 OK INFO: 10.16.34.18:1883 - "GET /logo.png HTTP/1.1" 200 OK

INFO Task: llm:sft INFO: 10.16.34.18:1883 - "GET /params/llm%3Asft HTTP/1.1" 200 OK

Lachkar-Ahmed-Salim commented 7 months ago

===== Build Queued at 2023-12-06 22:31:36 / Commit SHA: 71571d6 =====

--> FROM docker.io/huggingface/autotrain-advanced:latest@sha256:a160d4e1549a5ab6b98ecac68a9f67cd5d6108c70f121ff13022f92191a5f68b DONE 0.0s

--> Pushing image DONE 2.3s

--> Exporting cache DONE 0.1s

abhishekkrthakur commented 7 months ago

Let's keep question limited to one single channel to avoid confusion :) your file doesnt look like a proper CSV. some rows have commas in the sentences that's causing it to believe it has more than two columns. you can fix it by using proper quoting.

Lachkar-Ahmed-Salim commented 7 months ago

OHH I see I'm going to clean that up , thank you , but tell me may I ask you how did learn that its a CSV problem from the log report ? I'm kinda new to that I woudld like to learn to read and understand these reports

abhishekkrthakur commented 7 months ago

its just a guess from: pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 21, saw 3 :)

Lachkar-Ahmed-Salim commented 7 months ago

It worked after correcting the CSV file , but now I get this error when the model is getting trained , ===== Application Startup at 2023-12-07 12:07:38 =====

========== == CUDA ==

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO AUTOTRAIN_USERNAME: shelvin94 INFO PROJECT_NAME: lfi0-zf3s-aiqd-0 INFO TASK_ID: 28 INFO DATA_PATH: shelvin94/autotrain-data-lfi0-zf3s-aiqd INFO MODEL: Helsinki-NLP/opus-mt-tc-big-ar-en INFO OUTPUT_MODEL_REPO: shelvin94/lfi0-zf3s-aiqd-0 INFO: Started server process [34] INFO: Waiting for application startup. INFO {'data_path': 'shelvin94/autotrain-data-lfi0-zf3s-aiqd', 'model': 'Helsinki-NLP/opus-mt-tc-big-ar-en', 'username': 'shelvin94', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'projectname': 'lfi0-zf3s-aiqd-0', 'token': 'hf**', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'shelvin94/lfi0-zf3s-aiqd-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'fp16': False, 'save_total_limit': 1, 'save_strategy': 'epoch', 'use_peft': False, 'use_int8': False, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []} INFO ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'no', '-m', 'autotrain.trainers.seq2seq', '--training_config', '/tmp/model/training_params.json'] INFO Started training with PID 85 INFO Process status: running INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) INFO: 10.16.20.172:24688 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJvbkJlaGFsZk9mIjp7Il9pZCI6IjYzYzJhYWEyOGNjODdjZjBjMDU5ZDUxMCIsInVzZXIiOiJzaGVsdmluOTQifSwiaWF0IjoxNzAxOTUwODY0LCJzdWIiOiIvc3BhY2VzL3NoZWx2aW45NC9hdXRvdHJhaW4tbGZpMC16ZjNzLWFpcWQtMCIsImV4cCI6MTcwMjAzNzI2NCwiaXNzIjoiaHR0cHM6Ly9odWdnaW5nZmFjZS5jbyJ9.WdXQ8vUAYIo-1RGqtxXwmRmg8LUyR0v7uVKGlqBB0hz50YKaCzoU3CWLVih3gS5j8Nt-KPoC6RHVTZ8edezWAQ HTTP/1.1" 200 OK The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. INFO Process status: sleeping Downloading builder script: 0%| | 0.00/6.27k [00:00<?, ?B/s] Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 38.5MB/s] INFO Starting training... INFO Training config: {'data_path': 'shelvin94/autotrain-data-lfi0-zf3s-aiqd', 'model': 'Helsinki-NLP/opus-mt-tc-big-ar-en', 'username': 'shelvin94', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'project_name': '/tmp/model', 'token': '*****', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'shelvin94/lfi0-zf3s-aiqd-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'fp16': False, 'save_total_limit': 1, 'save_strategy': 'epoch', 'use_peft': False, 'use_int8': False, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []} Downloading readme: 0%| | 0.00/617 [00:00<?, ?B/s] Downloading readme: 100%|██████████| 617/617 [00:00<00:00, 10.7MB/s] Downloading data files: 0%| | 0/2 [00:00<?, ?it/s] Downloading data: 0%| | 0.00/16.8k [00:00<?, ?B/s] Downloading data: 100%|██████████| 16.8k/16.8k [00:00<00:00, 105kB/s] Downloading data: 100%|██████████| 16.8k/16.8k [00:00<00:00, 105kB/s] Downloading data files: 50%|█████ | 1/2 [00:00<00:00, 6.21it/s] Downloading data: 0%| | 0.00/5.97k [00:00<?, ?B/s] Downloading data: 100%|██████████| 5.97k/5.97k [00:00<00:00, 103kB/s] Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 9.00it/s] Extracting data files: 0%| | 0/2 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 2506.31it/s] Generating train split: 0%| | 0/702 [00:00<?, ? examples/s] Generating train split: 100%|██████████| 702/702 [00:00<00:00, 290747.65 examples/s] Generating validation split: 0%| | 0/176 [00:00<?, ? examples/s] Generating validation split: 100%|██████████| 176/176 [00:00<00:00, 184872.90 examples/s] config.json: 0%| | 0.00/1.14k [00:00<?, ?B/s] config.json: 100%|██████████| 1.14k/1.14k [00:00<00:00, 13.7MB/s] pytorch_model.bin: 0%| | 0.00/603M [00:00<?, ?B/s] pytorch_model.bin: 2%|▏ | 10.5M/603M [00:00<00:18, 31.8MB/s] pytorch_model.bin: 3%|▎ | 21.0M/603M [00:00<00:13, 43.5MB/s] pytorch_model.bin: 10%|█ | 62.9M/603M [00:00<00:03, 137MB/s] pytorch_model.bin: 24%|██▍ | 147M/603M [00:00<00:01, 317MB/s] pytorch_model.bin: 35%|███▍ | 210M/603M [00:00<00:01, 378MB/s] pytorch_model.bin: 43%|████▎ | 262M/603M [00:00<00:00, 392MB/s] pytorch_model.bin: 59%|█████▉ | 357M/603M [00:01<00:00, 534MB/s] pytorch_model.bin: 75%|███████▍ | 451M/603M [00:01<00:00, 621MB/s] pytorch_model.bin: 95%|█████████▍| 572M/603M [00:01<00:00, 780MB/s] pytorch_model.bin: 100%|█████████▉| 603M/603M [00:01<00:00, 452MB/s] INFO Process status: sleeping ===== Application Startup at 2023-12-07 12:07:38 =====

========== == CUDA ==

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO AUTOTRAIN_USERNAME: shelvin94 INFO PROJECT_NAME: lfi0-zf3s-aiqd-0 INFO TASK_ID: 28 INFO DATA_PATH: shelvin94/autotrain-data-lfi0-zf3s-aiqd INFO MODEL: Helsinki-NLP/opus-mt-tc-big-ar-en INFO OUTPUT_MODEL_REPO: shelvin94/lfi0-zf3s-aiqd-0 INFO: Started server process [34] INFO: Waiting for application startup. INFO {'data_path': 'shelvin94/autotrain-data-lfi0-zf3s-aiqd', 'model': 'Helsinki-NLP/opus-mt-tc-big-ar-en', 'username': 'shelvin94', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'projectname': 'lfi0-zf3s-aiqd-0', 'token': 'hf**', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'shelvin94/lfi0-zf3s-aiqd-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'fp16': False, 'save_total_limit': 1, 'save_strategy': 'epoch', 'use_peft': False, 'use_int8': False, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []} INFO ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'no', '-m', 'autotrain.trainers.seq2seq', '--training_config', '/tmp/model/training_params.json'] INFO Started training with PID 85 INFO Process status: running INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) INFO: 10.16.20.172:24688 - "GET /?logs=container&__sign=eyJhbGciOiJFZERTQSJ9.eyJyZWFkIjp0cnVlLCJvbkJlaGFsZk9mIjp7Il9pZCI6IjYzYzJhYWEyOGNjODdjZjBjMDU5ZDUxMCIsInVzZXIiOiJzaGVsdmluOTQifSwiaWF0IjoxNzAxOTUwODY0LCJzdWIiOiIvc3BhY2VzL3NoZWx2aW45NC9hdXRvdHJhaW4tbGZpMC16ZjNzLWFpcWQtMCIsImV4cCI6MTcwMjAzNzI2NCwiaXNzIjoiaHR0cHM6Ly9odWdnaW5nZmFjZS5jbyJ9.WdXQ8vUAYIo-1RGqtxXwmRmg8LUyR0v7uVKGlqBB0hz50YKaCzoU3CWLVih3gS5j8Nt-KPoC6RHVTZ8edezWAQ HTTP/1.1" 200 OK The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. INFO Process status: sleeping Downloading builder script: 0%| | 0.00/6.27k [00:00<?, ?B/s] Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 38.5MB/s] INFO Starting training... INFO Training config: {'data_path': 'shelvin94/autotrain-data-lfi0-zf3s-aiqd', 'model': 'Helsinki-NLP/opus-mt-tc-big-ar-en', 'username': 'shelvin94', 'seed': 42, 'train_split': 'train', 'valid_split': 'validation', 'project_name': '/tmp/model', 'token': '*****', 'push_to_hub': True, 'text_column': 'autotrain_text', 'target_column': 'autotrain_label', 'repo_id': 'shelvin94/lfi0-zf3s-aiqd-0', 'lr': 5e-05, 'epochs': 3, 'max_seq_length': 128, 'max_target_length': 128, 'batch_size': 8, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'auto_find_batch_size': False, 'fp16': False, 'save_total_limit': 1, 'save_strategy': 'epoch', 'use_peft': False, 'use_int8': False, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'target_modules': []} Downloading readme: 0%| | 0.00/617 [00:00<?, ?B/s] Downloading readme: 100%|██████████| 617/617 [00:00<00:00, 10.7MB/s] Downloading data files: 0%| | 0/2 [00:00<?, ?it/s] Downloading data: 0%| | 0.00/16.8k [00:00<?, ?B/s] Downloading data: 100%|██████████| 16.8k/16.8k [00:00<00:00, 105kB/s] Downloading data: 100%|██████████| 16.8k/16.8k [00:00<00:00, 105kB/s] Downloading data files: 50%|█████ | 1/2 [00:00<00:00, 6.21it/s] Downloading data: 0%| | 0.00/5.97k [00:00<?, ?B/s] Downloading data: 100%|██████████| 5.97k/5.97k [00:00<00:00, 103kB/s] Downloading data files: 100%|██████████| 2/2 [00:00<00:00, 9.00it/s] Extracting data files: 0%| | 0/2 [00:00<?, ?it/s] Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 2506.31it/s] Generating train split: 0%| | 0/702 [00:00<?, ? examples/s] Generating train split: 100%|██████████| 702/702 [00:00<00:00, 290747.65 examples/s] Generating validation split: 0%| | 0/176 [00:00<?, ? examples/s] Generating validation split: 100%|██████████| 176/176 [00:00<00:00, 184872.90 examples/s] config.json: 0%| | 0.00/1.14k [00:00<?, ?B/s] config.json: 100%|██████████| 1.14k/1.14k [00:00<00:00, 13.7MB/s] pytorch_model.bin: 0%| | 0.00/603M [00:00<?, ?B/s] pytorch_model.bin: 2%|▏ | 10.5M/603M [00:00<00:18, 31.8MB/s] pytorch_model.bin: 3%|▎ | 21.0M/603M [00:00<00:13, 43.5MB/s] pytorch_model.bin: 10%|█ | 62.9M/603M [00:00<00:03, 137MB/s] pytorch_model.bin: 24%|██▍ | 147M/603M [00:00<00:01, 317MB/s] pytorch_model.bin: 35%|███▍ | 210M/603M [00:00<00:01, 378MB/s] pytorch_model.bin: 43%|████▎ | 262M/603M [00:00<00:00, 392MB/s] pytorch_model.bin: 59%|█████▉ | 357M/603M [00:01<00:00, 534MB/s] pytorch_model.bin: 75%|███████▍ | 451M/603M [00:01<00:00, 621MB/s] pytorch_model.bin: 95%|█████████▍| 572M/603M [00:01<00:00, 780MB/s] pytorch_model.bin: 100%|█████████▉| 603M/603M [00:01<00:00, 452MB/s] INFO Process status: sleeping /app/env/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. Please use token instead. warnings.warn( generation_config.json: 0%| | 0.00/301 [00:00<?, ?B/s] generation_config.json: 100%|██████████| 301/301 [00:00<00:00, 3.91MB/s] tokenizer_config.json: 0%| | 0.00/337 [00:00<?, ?B/s] tokenizer_config.json: 100%|██████████| 337/337 [00:00<00:00, 4.31MB/s] source.spm: 0%| | 0.00/915k [00:00<?, ?B/s] source.spm: 100%|██████████| 915k/915k [00:00<00:00, 291MB/s] INFO Process status: sleeping target.spm: 0%| | 0.00/804k [00:00<?, ?B/s] target.spm: 100%|██████████| 804k/804k [00:00<00:00, 231MB/s] vocab.json: 0%| | 0.00/2.20M [00:00<?, ?B/s] vocab.json: 100%|██████████| 2.20M/2.20M [00:00<00:00, 29.9MB/s] special_tokens_map.json: 0%| | 0.00/65.0 [00:00<?, ?B/s] special_tokens_map.json: 100%|██████████| 65.0/65.0 [00:00<00:00, 903kB/s] 0%| | 0/264 [00:00<?, ?it/s] 0%| | 1/264 [00:00<02:31, 1.73it/s] 1%| | 3/264 [00:00<00:51, 5.07it/s]

2%|▏ | 4/264 [00:00<00:51, 5.07it/s] 2%|▏ | 5/264 [00:00<00:35, 7.34it/s] 3%|▎ | 7/264 [00:01<00:28, 8.91it/s]

3%|▎ | 8/264 [00:01<00:28, 8.91it/s] 3%|▎ | 9/264 [00:01<00:25, 10.09it/s] 4%|▍ | 11/264 [00:01<00:23, 10.95it/s]

5%|▍ | 12/264 [00:01<00:23, 10.95it/s] 5%|▍ | 13/264 [00:01<00:21, 11.55it/s] 6%|▌ | 15/264 [00:01<00:20, 11.97it/s]

6%|▌ | 16/264 [00:01<00:20, 11.97it/s] 6%|▋ | 17/264 [00:01<00:20, 12.26it/s] 7%|▋ | 19/264 [00:01<00:19, 12.43it/s]

8%|▊ | 20/264 [00:02<00:19, 12.43it/s] 8%|▊ | 21/264 [00:02<00:19, 12.48it/s] 9%|▊ | 23/264 [00:02<00:19, 12.62it/s]

9%|▉ | 24/264 [00:02<00:19, 12.62it/s] 9%|▉ | 25/264 [00:02<00:18, 12.71it/s] 10%|█ | 27/264 [00:02<00:18, 12.83it/s]

11%|█ | 28/264 [00:02<00:18, 12.83it/s] 11%|█ | 29/264 [00:02<00:18, 12.86it/s] 12%|█▏ | 31/264 [00:02<00:18, 12.89it/s]

12%|█▏ | 32/264 [00:02<00:18, 12.89it/s] 12%|█▎ | 33/264 [00:03<00:17, 12.92it/s] 13%|█▎ | 35/264 [00:03<00:17, 12.94it/s]

14%|█▎ | 36/264 [00:03<00:17, 12.94it/s] 14%|█▍ | 37/264 [00:03<00:17, 12.93it/s] 15%|█▍ | 39/264 [00:03<00:17, 12.97it/s]

15%|█▌ | 40/264 [00:03<00:17, 12.97it/s] 16%|█▌ | 41/264 [00:03<00:17, 12.91it/s]> INFO Process status: sleeping 16%|█▋ | 43/264 [00:03<00:17, 12.82it/s]

17%|█▋ | 44/264 [00:03<00:17, 12.82it/s] 17%|█▋ | 45/264 [00:03<00:17, 12.75it/s] 18%|█▊ | 47/264 [00:04<00:16, 12.83it/s]

18%|█▊ | 48/264 [00:04<00:16, 12.83it/s] 19%|█▊ | 49/264 [00:04<00:16, 12.86it/s] 19%|█▉ | 51/264 [00:04<00:16, 12.87it/s]

20%|█▉ | 52/264 [00:04<00:16, 12.87it/s] 20%|██ | 53/264 [00:04<00:16, 12.91it/s] 21%|██ | 55/264 [00:04<00:16, 12.97it/s]

21%|██ | 56/264 [00:04<00:16, 12.97it/s] 22%|██▏ | 57/264 [00:04<00:15, 12.95it/s] 22%|██▏ | 59/264 [00:05<00:15, 12.91it/s]

23%|██▎ | 60/264 [00:05<00:15, 12.91it/s] 23%|██▎ | 61/264 [00:05<00:15, 12.89it/s] 24%|██▍ | 63/264 [00:05<00:15, 12.94it/s]

24%|██▍ | 64/264 [00:05<00:15, 12.94it/s] 25%|██▍ | 65/264 [00:05<00:15, 12.92it/s] 25%|██▌ | 67/264 [00:05<00:15, 12.97it/s]

26%|██▌ | 68/264 [00:05<00:15, 12.97it/s] 26%|██▌ | 69/264 [00:05<00:15, 12.97it/s] 27%|██▋ | 71/264 [00:05<00:14, 12.98it/s]

27%|██▋ | 72/264 [00:06<00:14, 12.98it/s] 28%|██▊ | 73/264 [00:06<00:14, 12.92it/s] 28%|██▊ | 75/264 [00:06<00:14, 12.96it/s]

29%|██▉ | 76/264 [00:06<00:14, 12.96it/s] 29%|██▉ | 77/264 [00:06<00:14, 12.98it/s] 30%|██▉ | 79/264 [00:06<00:14, 12.96it/s]

30%|███ | 80/264 [00:06<00:14, 12.96it/s] 31%|███ | 81/264 [00:06<00:14, 12.96it/s] 31%|███▏ | 83/264 [00:06<00:13, 12.98it/s]

32%|███▏ | 84/264 [00:07<00:13, 12.98it/s] 32%|███▏ | 85/264 [00:07<00:13, 12.98it/s] 33%|███▎ | 87/264 [00:07<00:13, 13.00it/s]

33%|███▎ | 88/264 [00:07<00:13, 13.00it/s] 0%| | 0/11 [00:00<?, ?it/s] 18%|█▊ | 2/11 [00:00<00:00, 14.58it/s] 36%|███▋ | 4/11 [00:00<00:00, 9.32it/s] 55%|█████▍ | 6/11 [00:00<00:00, 8.37it/s] 64%|██████▎ | 7/11 [00:00<00:00, 8.04it/s] 73%|███████▎ | 8/11 [00:00<00:00, 7.86it/s] 82%|████████▏ | 9/11 [00:01<00:00, 7.67it/s]> INFO Process status: sleeping

91%|█████████ | 10/11 [00:01<00:00, 7.57it/s] 100%|██████████| 11/11 [00:01<00:00, 7.59it/s]

33%|███▎ | 88/264 [00:09<00:13, 13.00it/s] 100%|██████████| 11/11 [00:02<00:00, 7.59it/s]

34%|███▎ | 89/264 [00:12<02:24, 1.21it/s] 34%|███▍ | 91/264 [00:12<01:43, 1.67it/s]

35%|███▍ | 92/264 [00:12<01:43, 1.67it/s] 35%|███▌ | 93/264 [00:12<01:15, 2.26it/s] 36%|███▌ | 95/264 [00:12<00:56, 3.00it/s]

36%|███▋ | 96/264 [00:12<00:55, 3.00it/s] 37%|███▋ | 97/264 [00:12<00:42, 3.90it/s] 38%|███▊ | 99/264 [00:13<00:33, 4.94it/s]

38%|███▊ | 100/264 [00:13<00:33, 4.94it/s] 38%|███▊ | 101/264 [00:13<00:26, 6.07it/s] 39%|███▉ | 103/264 [00:13<00:22, 7.21it/s]

39%|███▉ | 104/264 [00:13<00:22, 7.21it/s] 40%|███▉ | 105/264 [00:13<00:19, 8.32it/s]> INFO Process status: sleeping 41%|████ | 107/264 [00:13<00:16, 9.32it/s]

41%|████ | 108/264 [00:13<00:16, 9.32it/s] 41%|████▏ | 109/264 [00:13<00:15, 10.14it/s] 42%|████▏ | 111/264 [00:14<00:14, 10.85it/s]

42%|████▏ | 112/264 [00:14<00:14, 10.85it/s] 43%|████▎ | 113/264 [00:14<00:13, 11.40it/s] 44%|████▎ | 115/264 [00:14<00:12, 11.86it/s]

44%|████▍ | 116/264 [00:14<00:12, 11.86it/s] 44%|████▍ | 117/264 [00:14<00:12, 12.13it/s] 45%|████▌ | 119/264 [00:14<00:11, 12.40it/s]

45%|████▌ | 120/264 [00:14<00:11, 12.40it/s] 46%|████▌ | 121/264 [00:14<00:11, 12.50it/s] 47%|████▋ | 123/264 [00:14<00:11, 12.66it/s]

47%|████▋ | 124/264 [00:15<00:11, 12.66it/s] 47%|████▋ | 125/264 [00:15<00:10, 12.68it/s] 48%|████▊ | 127/264 [00:15<00:10, 12.79it/s]

48%|████▊ | 128/264 [00:15<00:10, 12.79it/s] 49%|████▉ | 129/264 [00:15<00:10, 12.79it/s] 50%|████▉ | 131/264 [00:15<00:10, 12.82it/s]

50%|█████ | 132/264 [00:15<00:10, 12.82it/s] 50%|█████ | 133/264 [00:15<00:10, 12.83it/s] 51%|█████ | 135/264 [00:15<00:10, 12.86it/s]

52%|█████▏ | 136/264 [00:16<00:09, 12.86it/s] 52%|█████▏ | 137/264 [00:16<00:09, 12.82it/s] 53%|█████▎ | 139/264 [00:16<00:09, 12.88it/s]

53%|█████▎ | 140/264 [00:16<00:09, 12.88it/s] 53%|█████▎ | 141/264 [00:16<00:09, 12.91it/s] 54%|█████▍ | 143/264 [00:16<00:09, 12.97it/s]

55%|█████▍ | 144/264 [00:16<00:09, 12.97it/s] 55%|█████▍ | 145/264 [00:16<00:09, 12.97it/s] 56%|█████▌ | 147/264 [00:16<00:09, 12.94it/s]

56%|█████▌ | 148/264 [00:16<00:08, 12.94it/s] 56%|█████▋ | 149/264 [00:16<00:08, 12.90it/s] 57%|█████▋ | 151/264 [00:17<00:08, 12.90it/s]

58%|█████▊ | 152/264 [00:17<00:08, 12.90it/s] 58%|█████▊ | 153/264 [00:17<00:08, 12.89it/s] 59%|█████▊ | 155/264 [00:17<00:08, 12.92it/s]

59%|█████▉ | 156/264 [00:17<00:08, 12.92it/s] 59%|█████▉ | 157/264 [00:17<00:08, 12.91it/s] 60%|██████ | 159/264 [00:17<00:08, 12.93it/s]

61%|██████ | 160/264 [00:17<00:08, 12.93it/s] 61%|██████ | 161/264 [00:17<00:07, 12.95it/s] 62%|██████▏ | 163/264 [00:18<00:07, 12.99it/s]

62%|██████▏ | 164/264 [00:18<00:07, 12.99it/s] 62%|██████▎ | 165/264 [00:18<00:07, 12.99it/s] 63%|██████▎ | 167/264 [00:18<00:07, 13.00it/s]

64%|██████▎ | 168/264 [00:18<00:07, 13.00it/s] 64%|██████▍ | 169/264 [00:18<00:07, 12.96it/s] 65%|██████▍ | 171/264 [00:18<00:07, 13.00it/s]> INFO Process status: sleeping

65%|██████▌ | 172/264 [00:18<00:07, 13.00it/s] 66%|██████▌ | 173/264 [00:18<00:07, 12.92it/s] 66%|██████▋ | 175/264 [00:18<00:06, 13.01it/s]

67%|██████▋ | 176/264 [00:19<00:06, 13.01it/s] 0%| | 0/11 [00:00<?, ?it/s] 18%|█▊ | 2/11 [00:00<00:00, 12.55it/s] 36%|███▋ | 4/11 [00:00<00:00, 7.98it/s] 45%|████▌ | 5/11 [00:00<00:00, 7.43it/s] 55%|█████▍ | 6/11 [00:00<00:00, 7.14it/s] 64%|██████▎ | 7/11 [00:00<00:00, 6.87it/s] 73%|███████▎ | 8/11 [00:01<00:00, 6.76it/s] 82%|████████▏ | 9/11 [00:01<00:00, 6.62it/s] 91%|█████████ | 10/11 [00:01<00:00, 6.56it/s] 100%|██████████| 11/11 [00:01<00:00, 6.60it/s]

67%|██████▋ | 176/264 [00:21<00:06, 13.01it/s] 100%|██████████| 11/11 [00:02<00:00, 6.60it/s]

INFO Process status: sleeping 67%|██████▋ | 177/264 [00:24<01:13, 1.18it/s] 68%|██████▊ | 179/264 [00:24<00:52, 1.62it/s]

68%|██████▊ | 180/264 [00:24<00:51, 1.62it/s] 69%|██████▊ | 181/264 [00:24<00:37, 2.19it/s] 69%|██████▉ | 183/264 [00:24<00:27, 2.92it/s]

70%|██████▉ | 184/264 [00:24<00:27, 2.92it/s] 70%|███████ | 185/264 [00:24<00:20, 3.80it/s] 71%|███████ | 187/264 [00:25<00:15, 4.84it/s]

71%|███████ | 188/264 [00:25<00:15, 4.84it/s] 72%|███████▏ | 189/264 [00:25<00:12, 5.95it/s] 72%|███████▏ | 191/264 [00:25<00:10, 7.11it/s]

73%|███████▎ | 192/264 [00:25<00:10, 7.11it/s] 73%|███████▎ | 193/264 [00:25<00:08, 8.18it/s] 74%|███████▍ | 195/264 [00:25<00:07, 9.18it/s]

74%|███████▍ | 196/264 [00:25<00:07, 9.18it/s] 75%|███████▍ | 197/264 [00:25<00:06, 10.08it/s] 75%|███████▌ | 199/264 [00:25<00:06, 10.80it/s]

76%|███████▌ | 200/264 [00:26<00:05, 10.80it/s] 76%|███████▌ | 201/264 [00:26<00:05, 11.37it/s] 77%|███████▋ | 203/264 [00:26<00:05, 11.82it/s]

77%|███████▋ | 204/264 [00:26<00:05, 11.82it/s] 78%|███████▊ | 205/264 [00:26<00:04, 12.14it/s] 78%|███████▊ | 207/264 [00:26<00:04, 12.41it/s]

79%|███████▉ | 208/264 [00:26<00:04, 12.41it/s] 79%|███████▉ | 209/264 [00:26<00:04, 12.55it/s] 80%|███████▉ | 211/264 [00:26<00:04, 12.71it/s]

80%|████████ | 212/264 [00:27<00:04, 12.71it/s] 81%|████████ | 213/264 [00:27<00:03, 12.79it/s] 81%|████████▏ | 215/264 [00:27<00:03, 12.87it/s]

82%|████████▏ | 216/264 [00:27<00:03, 12.87it/s] 82%|████████▏ | 217/264 [00:27<00:03, 12.90it/s] 83%|████████▎ | 219/264 [00:27<00:03, 12.97it/s]

83%|████████▎ | 220/264 [00:27<00:03, 12.97it/s] 84%|████████▎ | 221/264 [00:27<00:03, 12.95it/s] 84%|████████▍ | 223/264 [00:27<00:03, 12.98it/s]

85%|████████▍ | 224/264 [00:27<00:03, 12.98it/s] 85%|████████▌ | 225/264 [00:27<00:03, 12.92it/s] 86%|████████▌ | 227/264 [00:28<00:02, 12.90it/s]

86%|████████▋ | 228/264 [00:28<00:02, 12.90it/s] 87%|████████▋ | 229/264 [00:28<00:02, 12.86it/s] 88%|████████▊ | 231/264 [00:28<00:02, 12.90it/s]

88%|████████▊ | 232/264 [00:28<00:02, 12.90it/s] 88%|████████▊ | 233/264 [00:28<00:02, 12.87it/s]> INFO Process status: sleeping 89%|████████▉ | 235/264 [00:28<00:02, 12.89it/s]

89%|████████▉ | 236/264 [00:28<00:02, 12.89it/s] 90%|████████▉ | 237/264 [00:28<00:02, 12.91it/s] 91%|█████████ | 239/264 [00:29<00:01, 12.92it/s]

91%|█████████ | 240/264 [00:29<00:01, 12.92it/s] 91%|█████████▏| 241/264 [00:29<00:01, 12.88it/s] 92%|█████████▏| 243/264 [00:29<00:01, 12.95it/s]

92%|█████████▏| 244/264 [00:29<00:01, 12.95it/s] 93%|█████████▎| 245/264 [00:29<00:01, 12.93it/s] 94%|█████████▎| 247/264 [00:29<00:01, 12.92it/s]

94%|█████████▍| 248/264 [00:29<00:01, 12.92it/s] 94%|█████████▍| 249/264 [00:29<00:01, 12.91it/s] 95%|█████████▌| 251/264 [00:30<00:01, 12.92it/s]

95%|█████████▌| 252/264 [00:30<00:00, 12.92it/s] 96%|█████████▌| 253/264 [00:30<00:00, 12.93it/s] 97%|█████████▋| 255/264 [00:30<00:00, 12.95it/s]

97%|█████████▋| 256/264 [00:30<00:00, 12.95it/s] 97%|█████████▋| 257/264 [00:30<00:00, 12.89it/s] 98%|█████████▊| 259/264 [00:30<00:00, 12.93it/s]

98%|█████████▊| 260/264 [00:30<00:00, 12.93it/s] 99%|█████████▉| 261/264 [00:30<00:00, 12.94it/s] 100%|█████████▉| 263/264 [00:30<00:00, 12.99it/s]

100%|██████████| 264/264 [00:31<00:00, 12.99it/s] 0%| | 0/11 [00:00<?, ?it/s] 18%|█▊ | 2/11 [00:00<00:00, 12.83it/s] 36%|███▋ | 4/11 [00:00<00:00, 7.96it/s] 45%|████▌ | 5/11 [00:00<00:00, 7.49it/s] 55%|█████▍ | 6/11 [00:00<00:00, 7.17it/s] 64%|██████▎ | 7/11 [00:00<00:00, 6.92it/s] 73%|███████▎ | 8/11 [00:01<00:00, 6.82it/s] 82%|████████▏ | 9/11 [00:01<00:00, 6.69it/s] 91%|█████████ | 10/11 [00:01<00:00, 6.40it/s] 100%|██████████| 11/11 [00:01<00:00, 6.31it/s]> INFO Process status: sleeping

100%|██████████| 264/264 [00:33<00:00, 12.99it/s] 100%|██████████| 11/11 [00:02<00:00, 6.31it/s] There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].

100%|██████████| 264/264 [00:36<00:00, 12.99it/s] 100%|██████████| 264/264 [00:36<00:00, 7.26it/s]

INFO Finished training, saving model... 0%| | 0/11 [00:00<?, ?it/s] 18%|█▊ | 2/11 [00:00<00:00, 12.31it/s] 36%|███▋ | 4/11 [00:00<00:00, 7.83it/s] 45%|████▌ | 5/11 [00:00<00:00, 7.41it/s] 55%|█████▍ | 6/11 [00:00<00:00, 7.15it/s] 64%|██████▎ | 7/11 [00:00<00:00, 6.86it/s] 73%|███████▎ | 8/11 [00:01<00:00, 6.77it/s] 82%|████████▏ | 9/11 [00:01<00:00, 6.65it/s]> INFO Process status: sleeping 91%|█████████ | 10/11 [00:01<00:00, 6.37it/s] 100%|██████████| 11/11 [00:01<00:00, 6.29it/s] 100%|██████████| 11/11 [00:02<00:00, 4.18it/s] INFO Pushing model to hub... ERROR train has failed due to an exception: ERROR Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status response.raise_for_status() File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/app/src/autotrain/utils.py", line 280, in wrapper return func(*args, *kwargs) File "/app/src/autotrain/trainers/seq2seq/main.py", line 233, in train api.create_repo(repo_id=config.repo_id, repo_type="model", private=True) File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(args, **kwargs) File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2816, in create_repo hf_raise_for_status(r) File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 330, in hf_raise_for_status raise HfHubHTTPError(str(e), response=response) from e huggingface_hub.utils._errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6571b5c7-3f9d90bf61c52c8820150f94;6dabf0b9-9920-4e52-8562-59a0cb19e525)

You already created this model repo

INFO Pausing space... INFO Process status: zombie INFO Training process finished. Shutting down the server. INFO Process 34 or one of its children has not terminated in time INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [34]

abhishekkrthakur commented 7 months ago

please make sure your project name is unique and there are no other repositories in your huggingface account with the same name: no datasets, models or spaces. i generally use the randomly generated name and then rename the model later if needed.

Lachkar-Ahmed-Salim commented 7 months ago

I used a unique name : lm_logits = self.lm_head(outputs[0]) + self.final_logits_bias torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.48 GiB. GPU 0 has a total capacty of 21.99 GiB of which 6.47 GiB is free. Process 43024 has 15.51 GiB memory in use. Of the allocated memory 13.89 GiB is allocated by PyTorch, and 1.31 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF , is it because I use a free GPU ?

abhishekkrthakur commented 7 months ago

yes. big models need bigger gpu.