Loop attempt failed and parallelism errors

Hi there, I just run: modal run train.py --dataset sql_dataset.py --base chat7 --run-id chat7-sql

I have about 15 USD of credits left.

And I receive a pletora of red errors on the console:

To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

And training never finishes

Training Epoch: 10/10, step 0/9 completed (loss: 0.6004008054733276):  11%|█         | 1/9 [00:47<06:16, 47.11s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.6142641305923462):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5926907062530518):  11%|█         | 1/9 [00:47<06:17, 47.15s/it]
Training Epoch: 10/10, step 0/9 completed (loss: 0.5915755033493042):  11%|█         | 1/9 [00:47<06:17, 47.13s/it]
Traceback (most recent call last):
  File "/pkg/modal/_container_entrypoint.py", line 366, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 484, in run_inputs
    res = imp_fun.fun(*args, **kwargs)
  File "/root/train.py", line 47, in train
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
    time.sleep(monitor_interval)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 2 got signal: 2
Runner terminated, in-progress inputs will be re-scheduled

And then this is the last message:

grpclib.exceptions.GRPCError: (<Status.FAILED_PRECONDITION: 9>, 'App state is APP_STATE_STOPPED', None)

Any ideas what does that mean?

modal-labs / llm-finetuning

Loop attempt failed and parallelism errors #10