wandb / wandb

The AI developer platform. Use Weights & Biases to train and fine-tune models, and manage models from experimentation to production.
https://wandb.ai
MIT License
9.14k stars 671 forks source link

[CLI]: ERROR Dropped streaming file chunk (see wandb/debug-internal.log) #5948

Open ankur-gupta opened 1 year ago

ankur-gupta commented 1 year ago

Describe the bug

I ran a long-running training job for 201 epochs. The training ran for 201 epochs and then showed me the error below. I also got an email saying that my run had failed. The web UI only shows data till epoch 117 even though the training ran for 201 epochs and only failed because of wandb. I think that I lost the data from roughly epoch 118 to 201.

While the number of epochs is higher than usual, I would still like to be able to run these many epochs. This training took me one day to finish. This is a serious issue. Please let me know if there is a mistake with something I did or if this is a CLI issue. I need to decide whether to continue using wandb or start logging things manually to disk again.

I am also concerned that the error happened at the end of training instead of alerting me at epoch=117 mark when it stopped updating the web UI.

Epoch 201/201: 100%|█████| 810/810 [03:25<00:00,  3.94it/s], masked: loss=0.8966, accuracy_frac=0.7707 | unmasked: loss=18.4359, accuracy_frac=0.2010
validation: 100%|██████████| 19/19 [00:02<00:00,  7.83it/s], masked: loss=2.8257, accuracy_frac=0.5700 | unmasked: loss=18.8670, accuracy_frac=0.1540
wandb: Waiting for W&B process to finish... (success).
wandb: Network error (TransientError), entering retry loop.
wandb: ERROR Dropped streaming file chunk (see wandb/debug-internal.log)

# I had to press Control+C because it was stuck. 
^Cwandb: ERROR Control-C detected -- Run data was not synced

This was in wandb/debug-internal.log is a symlink to the actual log file which is ~32MB in size. About 70% of this log file is the same error: OSError: [Errno 24] Too many open files. Here is a summarized snippet of this log file.

2023-07-24 01:05:47,087 INFO    StreamThr :8980 [internal.py:wandb_internal():86] W&B internal server running at pid: 8980, started at: 2023-07-24 01:05:47.086829
2023-07-24 01:05:47,087 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: status
2023-07-24 01:05:47,089 INFO    WriterThread:8980 [datastore.py:open_for_write():85] open: ./.wb/wandb/run-20230724_010547-9ds95wx7/run-9ds95wx7.wandb
2023-07-24 01:05:47,089 DEBUG   SenderThread:8980 [sender.py:send():369] send: header
2023-07-24 01:05:47,103 DEBUG   SenderThread:8980 [sender.py:send():369] send: run
...
2023-07-24 08:12:29,657 DEBUG   SenderThread:8980 [sender.py:send_request():396] send_request: summary_record
2023-07-24 08:12:29,658 INFO    SenderThread:8980 [sender.py:_save_file():1354] saving file wandb-summary.json with policy end
2023-07-24 08:12:29,742 INFO    Thread-12 :8980 [dir_watcher.py:_on_file_modified():289] file/dir modified: ./.wb/wandb/run-20230724_010547-9ds95wx7/files/wandb-summary.json
2023-07-24 08:12:31,377 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: partial_history
2023-07-24 08:12:31,377 DEBUG   SenderThread:8980 [sender.py:send():369] send: history
2023-07-24 08:12:31,377 DEBUG   SenderThread:8980 [sender.py:send_request():396] send_request: summary_record
2023-07-24 08:12:31,377 INFO    SenderThread:8980 [sender.py:_save_file():1354] saving file wandb-summary.json with policy end
2023-07-24 08:12:31,662 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: log_artifact
2023-07-24 08:12:31,662 DEBUG   SenderThread:8980 [sender.py:send_request():396] send_request: log_artifact
2023-07-24 08:12:31,753 INFO    Thread-12 :8980 [dir_watcher.py:_on_file_modified():289] file/dir modified: ./.wb/wandb/run-20230724_010547-9ds95wx7/files/wandb-summary.json
2023-07-24 08:12:33,501 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: keepalive
2023-07-24 08:12:33,764 INFO    Thread-12 :8980 [dir_watcher.py:_on_file_modified():289] file/dir modified: ./.wb/wandb/run-20230724_010547-9ds95wx7/files/output.log
2023-07-24 08:12:35,019 ERROR   gpuapple  :8980 [gpu_apple.py:sample():81] GPU stats error: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/Users/username/.virtualenvs/pytorch/lib/python3.11/site-packages/wandb/sdk/internal/system/assets/gpu_apple.py", line 60, in sample
    subprocess.check_output(command, universal_newlines=True)
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1849, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
                                  ^^^^^^^^^
OSError: [Errno 24] Too many open files
2023-07-24 08:12:37,032 ERROR   gpuapple  :8980 [gpu_apple.py:sample():81] GPU stats error: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/Users/username/.virtualenvs/pytorch/lib/python3.11/site-packages/wandb/sdk/internal/system/assets/gpu_apple.py", line 60, in sample
    subprocess.check_output(command, universal_newlines=True)
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1849, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
                                  ^^^^^^^^^
OSError: [Errno 24] Too many open files
2023-07-24 08:12:38,551 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: keepalive
2023-07-24 08:12:39,038 ERROR   gpuapple  :8980 [gpu_apple.py:sample():81] GPU stats error: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/Users/username/.virtualenvs/pytorch/lib/python3.11/site-packages/wandb/sdk/internal/system/assets/gpu_apple.py", line 60, in sample
    subprocess.check_output(command, universal_newlines=True)
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1849, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
                                  ^^^^^^^^^
OSError: [Errno 24] Too many open files
2023-07-24 08:12:40,379 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: log_artifact
2023-07-24 08:12:40,395 DEBUG   HandlerThread:8980 [handler.py:handle_request():144] handle_request: partial_history
2023-07-24 08:12:41,049 ERROR   gpuapple  :8980 [gpu_apple.py:sample():81] GPU stats error: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/Users/username/.virtualenvs/pytorch/lib/python3.11/site-packages/wandb/sdk/internal/system/assets/gpu_apple.py", line 60, in sample
    subprocess.check_output(command, universal_newlines=True)
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/subprocess.py", line 1849, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
                                  ^^^^^^^^^
OSError: [Errno 24] Too many open files
...
(same error again 20888 times total)
...

Here is the error count

$ grep "OSError: \[Errno 24\] Too many open files" debug-internal.log | wc -l
   20888

I already tried wandb sync but it didn't sync any more data to web UI. Is my data lost ?

$ wandb sync
wandb: No runs to be synced.

Additional Files

No response

Environment

WandB version: 0.15.5

OS: Machine: Apple MacBook Pro 14-inch, 2023, Apple M2 Max, 32 GB, macOS Ventura 13.4.1 (c)

Python version: 3.11.4

Versions of relevant libraries: ipython: 8.13.2 torch 2.0.1 torchtext: 0.15.2 transformers: 4.30.2 datasets: 2.12.0 numpy: 1.24.3 pandas: 2.0.1

Additional Context

There are similar open issues but their underlying issue was different. It doesn't seem to help me.

rsanandres-wandb commented 1 year ago

Hello @ankur-gupta !

Thank you for sending all the relevant information. Could you send us the following for a deeper investigation:

What I may be happening was that the process was taking a long time to .finish() and that the run was still in fact at the end (albeit that it may have taken a while to upload). Depending on the resources available during the run, wandb may defer the upload to a later time in order to not interfere with training speeds.

ankur-gupta commented 1 year ago

Hi @rsanandres-wandb, thank you for your comment. Is there a way to send you the full debug logs (~32MB) and the link to the project without it being public? Also, I've now reproduced the exact same error with another run of the same code.

In general, here is what I was uploading:

  1. Train & Validation metrics (loss, accuracy) every epoch
  2. Trained model every epoch (~18MB/model checkpoint)
  3. wandb.Table containing 4 columns and 6 * current_epoch_number rows every epoch.

I feel that the wandb.Table might be the issue. I have a tiny, custom validation dataset of only 6 examples which I use as a smoke test after every epoch. I want to monitor the improvement in model performance over these 6 examples after every epoch. I don't want to wait until the end of all 201 epochs to see the model prediction on these 6 examples.

However, wandb does not allow me to append rows to the table after I execute run.log({'check/preds': table}), where table is a wandb.Table object. So, I create a new table = wandb.Table(columns=['epoch', 'input', 'true_output', 'pred_output']) object for every epoch.

I got this solution from one of these issues:

ankur-gupta commented 1 year ago

In case someone else encounters the same error, I have verified that this error is because of logging the wandb.Table every epoch. Once I commented out that code, this error went away.

Lily-Le commented 4 months ago

To avoid the network problem, I ran the run offline and then uploaded the offline run file. Why does this problem still exist? I didn't log too many files like Tables. Thanks! Is there any solution?