Open ankur-gupta opened 1 year ago
Hello @ankur-gupta !
Thank you for sending all the relevant information. Could you send us the following for a deeper investigation:
debug-internal.log
and the debug.log
) for the run. I want to look into if there are other reasons for why the run hung.wandb
was trying to upload.What I may be happening was that the process was taking a long time to .finish()
and that the run was still in fact at the end (albeit that it may have taken a while to upload). Depending on the resources available during the run, wandb
may defer the upload to a later time in order to not interfere with training speeds.
Hi @rsanandres-wandb, thank you for your comment. Is there a way to send you the full debug logs (~32MB) and the link to the project without it being public? Also, I've now reproduced the exact same error with another run of the same code.
In general, here is what I was uploading:
wandb.Table
containing 4 columns and 6 * current_epoch_number
rows every epoch. I feel that the wandb.Table
might be the issue. I have a tiny, custom validation dataset of only 6 examples which I use as a smoke test after every epoch. I want to monitor the improvement in model performance over these 6 examples after every epoch. I don't want to wait until the end of all 201 epochs to see the model prediction on these 6 examples.
However, wandb
does not allow me to append rows to the table after I execute run.log({'check/preds': table})
, where table
is a wandb.Table
object. So, I create a new table = wandb.Table(columns=['epoch', 'input', 'true_output', 'pred_output'])
object for every epoch.
I got this solution from one of these issues:
In case someone else encounters the same error, I have verified that this error is because of logging the wandb.Table
every epoch. Once I commented out that code, this error went away.
To avoid the network problem, I ran the run offline and then uploaded the offline run file. Why does this problem still exist? I didn't log too many files like Tables. Thanks! Is there any solution?
Describe the bug
I ran a long-running training job for 201 epochs. The training ran for 201 epochs and then showed me the error below. I also got an email saying that my run had failed. The web UI only shows data till epoch 117 even though the training ran for 201 epochs and only failed because of
wandb
. I think that I lost the data from roughly epoch 118 to 201.While the number of epochs is higher than usual, I would still like to be able to run these many epochs. This training took me one day to finish. This is a serious issue. Please let me know if there is a mistake with something I did or if this is a CLI issue. I need to decide whether to continue using
wandb
or start logging things manually to disk again.I am also concerned that the error happened at the end of training instead of alerting me at epoch=117 mark when it stopped updating the web UI.
This was in
wandb/debug-internal.log
is a symlink to the actual log file which is ~32MB in size. About 70% of this log file is the same error:OSError: [Errno 24] Too many open files
. Here is a summarized snippet of this log file.Here is the error count
I already tried
wandb sync
but it didn't sync any more data to web UI. Is my data lost ?Additional Files
No response
Environment
WandB version: 0.15.5
OS: Machine: Apple MacBook Pro 14-inch, 2023, Apple M2 Max, 32 GB, macOS Ventura 13.4.1 (c)
Python version: 3.11.4
Versions of relevant libraries: ipython: 8.13.2 torch 2.0.1 torchtext: 0.15.2 transformers: 4.30.2 datasets: 2.12.0 numpy: 1.24.3 pandas: 2.0.1
Additional Context
There are similar open issues but their underlying issue was different. It doesn't seem to help me.