huggingface / autotrain-advanced

πŸ€— AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
4.04k stars 494 forks source link

[BUG] LLM-ORPO training always failing due to 'Tokenizer' argument #802

Closed jmparejaz closed 1 week ago

jmparejaz commented 1 week ago

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

image image

Error Logs

INFO: Started server process [3553] INFO: Waiting for application startup. INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-ihh1b-19lgj/training_params.json'] INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:524 - {'model': 'Qwen/Qwen2.5-7B', 'project_name': 'autotrain-ihh1b-19lgj', 'data_path': 'mlabonne/orpo-dpo-mix-40k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 32768, 'padding': 'right', 'trainer': 'orpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 0.0003, 'epochs': 1, 'batch_size': 5, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'cosine', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': '{%- if tools %}\n {{- \'<|im_start|>system\n\' }}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- messages[0][\'content\'] }}\n {%- else %}\n {{- \'You are a helpful assistant.\' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{\"name\": , \"arguments\": }\n<|im_end|>\n" }}\n{%- else %}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- \'<|im_start|>system\n\' + messages[0][\'content\'] + \'<|im_end|>\n\' }}\n {%- else %}\n {{- \'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n\' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- \'<|im_start|>\' + message.role + \'\n\' + message.content + \'<|im_end|>\' + \'\n\' }}\n {%- elif message.role == "assistant" %}\n {{- \'<|im_start|>\' + message.role }}\n {%- if message.content %}\n {{- \'\n\' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- \'\n\n{"name": "\' }}\n {{- tool_call.name }}\n {{- \'", "arguments": \' }}\n {{- tool_call.arguments | tojson }}\n {{- \'}\n\' }}\n {%- endfor %}\n {{- \'<|im_end|>\n\' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- \'<|im_start|>user\' }}\n {%- endif %}\n {{- \'\n\n\' }}\n {{- message.content }}\n {{- \'\n\' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- \'<|im_end|>\n\' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|im_start|>assistant\n\' }}\n{%- endif %}\n', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': True, 'peft': True, 'lora_r': 16, 'lora_alpha': 64, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'xxxxxx', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'} INFO | 2024-11-09 16:40:30 | autotrain.app.training_api:lifespan:82 - Started training with PID 3619 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [2024-11-09 16:40:34,843] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO | 2024-11-09 16:40:35 | autotrain.trainers.clm.train_clm_orpo:train:11 - Starting ORPO training...

Generating train split: 0%| | 0/44245 [00:00<?, ? examples/s] Generating train split: 9%|β–‰ | 4000/44245 [00:00<00:01, 29915.24 examples/s] Generating train split: 29%|β–ˆβ–ˆβ–‰ | 13000/44245 [00:00<00:00, 52357.45 examples/s] Generating train split: 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 23000/44245 [00:00<00:00, 66570.55 examples/s] Generating train split: 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 32000/44245 [00:00<00:00, 69872.51 examples/s] Generating train split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 44245/44245 [00:00<00:00, 78070.03 examples/s] INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:546 - Train data: Dataset({ features: ['source', 'chosen', 'rejected', 'prompt', 'question'], num_rows: 44245 }) INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:547 - Valid data: None INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:667 - configuring logging steps INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:680 - Logging steps: 25 INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_training_args:719 - configuring training args INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_block_size:797 - Using block size 2048 INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:873 - Can use unsloth: False WARNING | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:915 - Unsloth not available, continuing without it... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:917 - loading model config... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:925 - loading model... low_cpu_mem_usage was None, now default to True since model is quantized. INFO | 2024-11-09 16:41:52 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxx-autotrain-ihh1b-19lgj: GET - 202 - Polling reqId for completion INFO | 2024-11-09 16:42:27 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxxx-autotrain-ihh1b-19lgj: GET - 202 - Polling reqId for completion INFO | 2024-11-09 16:42:47 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxxx-autotrain-ihh1b-19lgj: GET - 200 - Polling completed 1-09 16:40:30 | autotrain.app.training_api::95 - AUTOTRAIN_USERNAME: XXXXXX INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::96 - PROJECT_NAME: autotrain-ihh1b-19lgj INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::97 - TASK_ID: 9 INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::98 - DATA_PATH: mlabonne/orpo-dpo-mix-40k INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::99 - MODEL: Qwen/Qwen2.5-7B INFO: Started server process [3553] INFO: Waiting for application startup. INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-ihh1b-19lgj/training_params.json'] INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:524 - {'model': 'Qwen/Qwen2.5-7B', 'project_name': 'autotrain-ihh1b-19lgj', 'data_path': 'mlabonne/orpo-dpo-mix-40k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 32768, 'padding': 'right', 'trainer': 'orpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 0.0003, 'epochs': 1, 'batch_size': 5, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'cosine', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': '{%- if tools %}\n {{- \'<|im_start|>system\n\' }}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- messages[0][\'content\'] }}\n {%- else %}\n {{- \'You are a helpful assistant.\' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{\"name\": , \"arguments\": }\n<|im_end|>\n" }}\n{%- else %}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- \'<|im_start|>system\n\' + messages[0][\'content\'] + \'<|im_end|>\n\' }}\n {%- else %}\n {{- \'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n\' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- \'<|im_start|>\' + message.role + \'\n\' + message.content + \'<|im_end|>\' + \'\n\' }}\n {%- elif message.role == "assistant" %}\n {{- \'<|im_start|>\' + message.role }}\n {%- if message.content %}\n {{- \'\n\' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- \'\n\n{"name": "\' }}\n {{- tool_call.name }}\n {{- \'", "arguments": \' }}\n {{- tool_call.arguments | tojson }}\n {{- \'}\n\' }}\n {%- endfor %}\n {{- \'<|im_end|>\n\' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- \'<|im_start|>user\' }}\n {%- endif %}\n {{- \'\n\n\' }}\n {{- message.content }}\n {{- \'\n\' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- \'<|im_end|>\n\' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|im_start|>assistant\n\' }}\n{%- endif %}\n', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': True, 'peft': True, 'lora_r': 16, 'lora_alpha': 64, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'xxxxxx', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'} INFO | 2024-11-09 16:40:30 | autotrain.app.training_api:lifespan:82 - Started training with PID 3619 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [2024-11-09 16:40:34,843] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO | 2024-11-09 16:40:35 | autotrain.trainers.clm.train_clm_orpo:train:11 - Starting ORPO training...

Generating train split: 0%| | 0/44245 [00:00<?, ? examples/s] Generating train split: 9%|β–‰ | 4000/44245 [00:00<00:01, 29915.24 examples/s] Generating train split: 29%|β–ˆβ–ˆβ–‰ | 13000/44245 [00:00<00:00, 52357.45 examples/s] Generating train split: 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 23000/44245 [00:00<00:00, 66570.55 examples/s] Generating train split: 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 32000/44245 [00:00<00:00, 69872.51 examples/s] Generating train split: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 44245/44245 [00:00<00:00, 78070.03 examples/s] INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:546 - Train data: Dataset({ features: ['source', 'chosen', 'rejected', 'prompt', 'question'], num_rows: 44245 }) INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:547 - Valid data: None INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:667 - configuring logging steps INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:680 - Logging steps: 25 INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_training_args:719 - configuring training args INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_block_size:797 - Using block size 2048 INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:873 - Can use unsloth: False WARNING | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:915 - Unsloth not available, continuing without it... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:917 - loading model config... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:925 - loading model... low_cpu_mem_usage was None, now default to True since model is quantized.

Downloading shards: 0%| | 0/4 [00:00<?, ?it/s] Downloading shards: 25%|β–ˆβ–ˆβ–Œ | 1/4 [00:10<00:31, 10.53s/it] Downloading shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:55<01:00, 30.50s/it] Downloading shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [01:06<00:21, 21.66s/it] Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [01:12<00:00, 15.57s/it] Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [01:12<00:00, 18.09s/it] The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use attn_implementation="flash_attention_2" instead.

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 1/4 [00:03<00:09, 3.30s/it] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2/4 [00:06<00:06, 3.14s/it] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 3/4 [00:09<00:03, 3.05s/it] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:12<00:00, 2.97s/it] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:12<00:00, 3.03s/it] INFO | 2024-11-09 16:42:21 | autotrain.trainers.clm.utils:get_model:956 - model dtype: torch.float16 INFO | 2024-11-09 16:42:22 | autotrain.trainers.clm.train_clm_orpo:train:39 - creating trainer ERROR | 2024-11-09 16:42:22 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last): File "/app/src/autotrain/trainers/common.py", line 212, in wrapper return func(*args, kwargs) File "/app/src/autotrain/trainers/clm/main.py", line 43, in train train_orpo(config) File "/app/src/autotrain/trainers/clm/train_clm_orpo.py", line 47, in train trainer = ORPOTrainer( TypeError: ORPOTrainer.init() got an unexpected keyword argument 'tokenizer' ERROR | 2024-11-09 16:42:22 | autotrain.trainers.common:wrapper:216 - ORPOTrainer.init() got an unexpected keyword argument 'tokenizer' INFO | 2024-11-09 16:42:30 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 3619 INFO | 2024-11-09 16:42:30 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 3619 INFO | 2024-11-09 16:42:30 | autotrain.app.training_api:run_main:56 - No running jobs found. Shutting down the server. INFO | 2024-11-09 16:42:30 | autotrain.app.training_api:graceful_exit:35 - SIGTERM received. Performing cleanup... ERROR: Traceback (most recent call last): File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/app/src/autotrain/app/training_api.py", line 57, in run_main kill_process_by_pid(os.getpid()) File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid os.kill(pid, signal.SIGTERM) File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit sys.exit(0) SystemExit: 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 700, in lifespan await receive() File "/app/env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "/app/env/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError Task exception was never retrieved future: <Task finished name='Task-3' coro=<BackgroundRunner.run_main() done, defined at /app/src/autotrain/app/training_api.py:52> exception=SystemExit(0)> Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/app/env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, *ctx.params) File "/app/env/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 412, in main run( File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 579, in run server.run() File "/app/env/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run return asyncio.run(self.serve(sockets=sockets)) File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/app/src/autotrain/app/training_api.py", line 57, in run_main kill_process_by_pid(os.getpid()) File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid os.kill(pid, signal.SIGTERM) File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit sys.exit(0) SystemExit: 0

Additional Information

It initially worked and the trainer was running for over 9hrs and reached 95% of progress on all 7 epochs but then it didnt updated to the repo with more commits, I waited 4 hours and then decided to restart the space.

after that, I tried to replicate my initial training with the same paremeters and never worked, alwasy the same error about the tokenizer argument. Then I have been trying to make it work tweaking Hyperparameters, modes, datasets, gpus etc and never worked.

abhishekkrthakur commented 1 week ago

fixed in version 0.8.30 and above. thank you for reporting it.

jmparejaz commented 1 week ago

it is breaking with the same error :'( autotrain_error1 autotrain_error2

abhishekkrthakur commented 1 week ago

you need to factory reset your space to make sure you are on the latest version

jmparejaz commented 1 week ago

yeah I did it. And test it multiple times before posting here. I even changed my browser to avoid any cache or cookies or whatever... I highlighted in the first image of my previous message that it is installing the 8.8.31.dev8 version of Autotrain advanced. But anyways, I am going to factory reset again

abhishekkrthakur commented 1 week ago

trying on my end too as we speak.

abhishekkrthakur commented 1 week ago

oh wait, you are running on DGX Cloud?

jmparejaz commented 1 week ago

autotrain with dgx cloud with hf space image

abhishekkrthakur commented 1 week ago

ahh... something happened and the latest version was not deployed on dgx cloud. ive created another deployment just now and it should start working after ~45-60minutes. it seems i overlooked the screenshot you posted in OP. Apologies for the inconvenience. it will work after the aforementioned time. ive tested it in spaces.

jmparejaz commented 1 week ago

Dont worry Abishek thanks for your support! Autotrain with dgx is amazing!

jmparejaz commented 1 week ago

it is working now with dgx but metrics are not being reported to tensorboard image

image it finished the training but without metrics logging it is like walking in a dark forest :/ image

abhishekkrthakur commented 1 week ago

sometimes tensorboard tab takes a while to load. i see you have "runs" folder so metrics are available. ill let the team know if this persists

jmparejaz commented 1 week ago

it persists...the tesorboard is not loading any metrics. I think that metrics are not being loaded to the run folder, check the event file events.out.tfevents.1731348305.0-sr-06e8e06b-d870-4c1a-b787-fa215ea35067.1973.0.zip

In the past I tried to use the argument "report_to" : "wandb" but it didn't worked I set WANDB_API_KEY, WANDB_LOG_MODEL and WANDB_PROJECT secret vars is there any way to use the autotrain UI with WANDB?

jmparejaz commented 1 week ago

I am closing this issue and creating a new one for the tensorboard problem