modal-labs / llm-finetuning

Guide for fine-tuning Llama/Mistral/CodeLlama models and more
MIT License
538 stars 84 forks source link

Loop attempt failed and parallelism errors #10

Closed priamai closed 9 months ago

priamai commented 1 year ago

Hi there, I just run: modal run train.py --dataset sql_dataset.py --base chat7 --run-id chat7-sql

I have about 15 USD of credits left.

And I receive a pletora of red errors on the console:

To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

And training never finishes

Training Epoch: 10/10, step 0/9 completed (loss: 0.6004008054733276):  11%|█         | 1/9 [00:47<06:16, 47.11s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.6142641305923462):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5926907062530518):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5915755033493042):  11%|█         | 1/9 [00:47<06:17, 47.13s/it]
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 366, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 484, in run_inputs
    res = imp_fun.fun(*args, **kwargs)
  File "/root/train.py", line 47, in train
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
    time.sleep(monitor_interval)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 2 got signal: 2
Runner terminated, in-progress inputs will be re-scheduled

And then this is the last message:

grpclib.exceptions.GRPCError: (<Status.FAILED_PRECONDITION: 9>, 'App state is APP_STATE_STOPPED', None)

Any ideas what does that mean?

gongy commented 1 year ago

Hi, did you run this with --detach?

If the terminal you called the run from disconnects (laptop lid is shut, for example) then the app will be stopped as your client stops heartbeating. To avoid this you can use --detach which means "even if I disconnect, keep running the app". This is off by default - we aim for continuation on disconnect to be intentional rather than accidental.

For follow-up support, kindly feel free to join our Slack so we can provide resolutions more quickly, and I can help you with your credits as well.

The red tokenizers errors are benign - they happen on the official Meta repo also.