Training jobs stuck at saving a checkpoint - likely due to failure at 1-2 workers

Ivan-Zhou commented 7 months ago

After saving a checkpoint, the model training may stuck and could not recover from it until restarting:

There is no error message in the log. The instance did not crash/restart;
The behavior is not deterministic, it does not always happen, but it happens frequently;
Once it stuck, there is increase in the process CPU threads and drop in the non-swap process memory.

After watching this issue for a few times on V3-64 and V3-128, I noticed that the the log Saved checkpoint for step xxx does not print for all nodes, meaning that it might be due to the root cause that checkpoint saving in one or two worker fails (silently) to complete, thus pausing the entire training job.

Examples:

Example 1

This training job stuck when saving a checkpoint at 3 AM on Feb 18. It stuck until I terminated the job at 4 PM on the same day. (I restarted this job one day later and it is able to continue the training) https://wandb.ai/stanford-mercury/levanter/runs/we9felvkd0elo30217/

Example 2

This training job has seen this issue for multiple times. Every time, I terminated and restarted the job, and it was able to resume from the saved checkpoint.

https://wandb.ai/stanford-mercury/levanter/runs/we9felvkd0elo30212

At step 643, it printed Saved checkpoint for 7 times, but it is expected to print 8 times (I was using v3-64)

At step 9485, it printed for 13 times. I was using v3-128, so it is expected to print for 16 times.

24-02-24T13:38:44 - 14 - levanter.checkpoint - checkpoint.py:221 - INFO :: Saving checkpoint at step 9485 to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485
2024-02-24T13:38:44 - 3 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 1 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 8 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 7 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 2 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 11 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 12 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 4 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 13 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 9 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 15 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 10 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 5 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 0 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 6 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:44 - 14 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485
2024-02-24T13:38:48 - 5 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:38:48 - 3 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:38:48 - 10 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:38:49 - 0 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:44:47 - 14 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:44:48 - 11 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:44:48 - 6 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:44:49 - 9 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:50:48 - 12 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:50:48 - 7 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:50:48 - 15 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:50:48 - 4 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485
2024-02-24T13:50:48 - 13 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485

Same as the step 10K, it printed for 14 times:

024-02-25T04:56:11 - 1 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 4 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 15 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 10 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 9 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 2 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 5 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 11 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 3 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 6 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 13 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 0 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 8 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 14 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 12 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:11 - 7 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000
2024-02-25T04:56:15 - 12 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 14 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 10 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 3 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 5 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 9 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T04:56:15 - 0 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:02:15 - 15 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:02:15 - 7 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:02:16 - 4 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:08:14 - 1 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:08:14 - 13 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:08:15 - 8 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000
2024-02-25T05:08:15 - 11 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000

dlwh commented 7 months ago

ugh. This is going to be a bear to sort out. Maybe I'm using asyncio wrong?

On Sun, Feb 25, 2024 at 10:53 AM Ivan Zhou @.***> wrote:

After saving a checkpoint, the model training may stuck and could not recover from it until restarting:

There is no error message in the log. The instance did not crash/restart;

The behavior is not deterministic, it does not always happen, but it happens frequently;

Once it stuck, there is increase in the process CPU threads and drop in the non-swap process memory.

After watching this issue for a few times on V3-64 and V3-128, I noticed that the the log Saved checkpoint for step xxx does not print for all nodes, meaning that it might be due to the root cause that checkpoint saving in one or two worker fails (silently) to complete, thus pausing the entire training job. Examples: Example 1

This training job stuck when saving a checkpoint at 3 AM on Feb 18. It stuck until I terminated the job at 4 PM on the same day. (I restarted this job one day later and it is able to continue the training) https://wandb.ai/stanford-mercury/levanter/runs/we9felvkd0elo30217/ Screenshot.2024-02-25.at.10.45.07.AM.png (view on web) https://github.com/stanford-crfm/levanter/assets/8097756/4116622e-7902-4f94-8703-287d1f5ab57e Screenshot.2024-02-25.at.10.45.20.AM.png (view on web) https://github.com/stanford-crfm/levanter/assets/8097756/969c2be6-f36c-4bfc-8b19-c4d452e7a17b Example 2

This training job has seen this issue for multiple times. Every time, I terminated and restarted the job, and it was able to resume from the saved checkpoint.

https://wandb.ai/stanford-mercury/levanter/runs/we9felvkd0elo30212

At step 643, it printed Saved checkpoint for 7 times, but it is expected to print 8 times (I was using v3-64) image.png (view on web) https://github.com/stanford-crfm/levanter/assets/8097756/31801dbb-0a65-4de9-a905-147c36d8778e

At step 9485, it printed for 13 times. I was using v3-128, so it is expected to print for 16 times.

24-02-24T13:38:44 - 14 - levanter.checkpoint - checkpoint.py:221 - INFO :: Saving checkpoint at step 9485 to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 2024-02-24T13:38:44 - 3 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 1 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 8 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 7 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 2 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 11 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 12 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 4 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 13 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 9 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 15 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 10 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 5 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 0 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 6 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:44 - 14 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-9485 for step 9485 2024-02-24T13:38:48 - 5 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:38:48 - 3 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:38:48 - 10 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:38:49 - 0 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:44:47 - 14 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:44:48 - 11 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:44:48 - 6 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:44:49 - 9 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:50:48 - 12 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:50:48 - 7 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:50:48 - 15 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:50:48 - 4 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485 2024-02-24T13:50:48 - 13 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 9485

Same as the step 10K, it printed for 14 times:

024-02-25T04:56:11 - 1 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 4 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 15 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 10 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 9 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 2 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 5 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 11 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 3 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 6 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 13 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 0 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 8 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 14 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 12 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:11 - 7 - levanter.checkpoint - checkpoint.py:246 - INFO :: Saving checkpoint to gs://levanter-checkpoints/llama2/background-mixture-all-data-llama2-3b-rpv1-run_0212/we9felvkd0elo30212/step-10000 for step 10000 2024-02-25T04:56:15 - 12 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 14 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 10 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 3 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 5 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 9 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T04:56:15 - 0 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:02:15 - 15 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:02:15 - 7 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:02:16 - 4 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:08:14 - 1 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:08:14 - 13 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:08:15 - 8 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000 2024-02-25T05:08:15 - 11 - levanter.checkpoint - checkpoint.py:258 - INFO :: Saved checkpoint for step 10000

— Reply to this email directly, view it on GitHub https://github.com/stanford-crfm/levanter/issues/486, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLILSUKKLHTVC35JEQNTYVOCCRAVCNFSM6AAAAABDZCOLMKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2TEOBZGM3DENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Ivan-Zhou commented 7 months ago

More observation on this issue:

As I move from V3-64 to V3-128, this issue happens almost deterministically. This is likely correlated with more workers in the cluster.
After getting stuck and restarting the job, I consistently see a Ray and RPC error on 2 or more workers, as shown below. I have to restart the cluster in order to resume training (which always works).

4-02-29T17:06:36 - 9 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 7 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 6 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 10 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 15 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 5 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 13 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 2 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 11 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 14 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 12 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 4 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 1 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
2024-02-29T17:06:36 - 0 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter
[2024-02-29 17:06:44,539 E 68977 69929] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.142.0.36:40949: Failed to connect to remote host: Connection refused; RPC Error details:  .Please see `raylet.out` for more details.
[2024-02-29 17:06:44,573 E 66434 67383] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.142.0.97:40949: Failed to connect to remote host: Connection refused; RPC Error details:  .Please see `raylet.out` for more details.
##### Command execution on worker 13 failed with exit status 1. Continuing.
##### Command execution on worker 7 failed with exit status 1. Continuing.

dlwh commented 7 months ago

Ok try disabling ray then

On Thu, Feb 29, 2024 at 9:33 AM Ivan Zhou @.***> wrote:

More observation on this issue:

As I move from V3-64 to V3-128, this issue happens almost deterministically. This is likely correlated with more workers in the cluster.

After getting stuck and restarting the job, I consistently see a Ray and RPC error on 2 or more workers, as shown below. I have to restart the cluster in order to resume training (which always works).

4-02-29T17:06:36 - 9 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 7 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 6 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 10 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 15 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 5 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 13 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 2 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 11 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 14 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 12 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 4 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 1 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter 2024-02-29T17:06:36 - 0 - levanter.tracker.wandb - wandb.py:189 - INFO :: Setting wandb code_dir to /home/ivan/levanter [2024-02-29 17:06:44,539 E 68977 69929] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.142.0.36:40949: Failed to connect to remote host: Connection refused; RPC Error details: .Please see raylet.out for more details. [2024-02-29 17:06:44,573 E 66434 67383] core_worker_process.cc:216: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.142.0.97:40949: Failed to connect to remote host: Connection refused; RPC Error details: .Please see raylet.out for more details.

Command execution on worker 13 failed with exit status 1. Continuing.

Command execution on worker 7 failed with exit status 1. Continuing.

— Reply to this email directly, view it on GitHub https://github.com/stanford-crfm/levanter/issues/486#issuecomment-1971627589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACLINHS23DTU3SL3KDOFDYV5S57AVCNFSM6AAAAABDZCOLMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGYZDONJYHE . You are receiving this because you commented.Message ID: @.***>

Ivan-Zhou commented 7 months ago

Here is the script to reproduce the issue. You can ssh to the v3-128, kill the running job, and debug there:

TYPE="v3-128"
NAME="background-run"
ZONE="us-east1-d"
GS_BUCKET="gs://levanter-checkpoints"
BRANCH="main"
CONFIG_PATH="levanter/config/llama2_3b_pretrain.yaml"
export RUN_ID="xxxx"
WANDB_API_KEY="bexxxx"
HF_HUB_TOKEN="hf_xxxx"
WANDB_RUN_ID=$RUN_ID
WANDB_RUN_NAME="debugxxxx"
CMD="WANDB_API_KEY=$WANDB_API_KEY \
    HUGGING_FACE_HUB_TOKEN=$HF_HUB_TOKEN \
    RUN_ID=$RUN_ID \
    levanter/infra/run.sh python3 levanter/src/levanter/main/train_lm.py \
    --config_path $CONFIG_PATH \
    --trainer.wandb.resume true \
    --trainer.wandb.id $WANDB_RUN_ID \
    --trainer.wandb.name $WANDB_RUN_NAME \
    --trainer.max_eval_batches 1 \
    --trainer.load_checkpoint_path $GS_BUCKET/llama2/$WANDB_RUN_NAME/$RUN_ID/  \
    --trainer.checkpointer.base_path $GS_BUCKET/llama2/$WANDB_RUN_NAME \
    --trainer.checkpointer.save_interval 30m"

# echo $CMD
# babysit
infra/babysit-tpu-vm.sh $NAME -b $BRANCH -z $ZONE -t $TYPE -- $CMD

# launch directly
# gcloud compute tpus tpu-vm ssh $NAME --zone $ZONE --worker=all --command "$CMD"

dlwh commented 7 months ago

so close to fixing this.

peregilk commented 5 months ago

I am seeing the same when training on a v5e. Was this fixed? @dlwh

dlwh commented 5 months ago

We recently ran into some issues again. Can you try setting these env variables:

TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60 TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=256

Also make sure the GCS bucket is in the same region

peregilk commented 5 months ago

Works perfectly for me. Thanks.

stanford-crfm / levanter