Open simonsays1980 opened 8 months ago
@simonsays1980 Does this error get logged happen very consistently throughout the run? I'm thinking this might be a race condition of wandb deleting some file, while that file is trying to be synced to cloud storage.
Does the run eventually succeed? The syncing should get retried throughout the run.
@simonsays1980 Does this error get logged happen very consistently throughout the run? I'm thinking this might be a race condition of wandb deleting some file, while that file is trying to be synced to cloud storage.
Does the run eventually succeed? The syncing should get retried throughout the run.
Hi @justinvyu thanks for taking a look. It does get logged consistently during the run. I later had to stop because of many _QueueActor
s and WandBActor
s started and stopped again - it looked like a mess. Regrettably I have no worker logs stored. I think it should be reproducable given the cluster YAML and the run script. The runs did not succeed because of these stopping and restarting actions. These took away the memory (156GB) and left no space for the runs - I got a OOM.
I ran another experiment then with double resources (24 -> 48 CPUs, 156GB -> 312GB, 2 T4 NVIDIA) and turned off WandB
. This run did not succeed either with GPUs after ~100 iterations not anymore detected (see #43866 )
What happened + What you expected to happen
What happened
I ran a hyperparameter search with
PB2
usingtune
andrllib
. I logged results withwandb
using theWandbLoggerCallback
. The experiement was run on GCP using Ray's autoscaler with the2.10.0.d8b3d6-py39-gpu
image. WhilePB2
worked seemlessly, thewandb
logging errored out quite early in the experiment (iteration 14 of 500) with the following error:Could this may be related to
PB2
pausing and restarting runs? At least inwandb
I see only 14 iterations in the runs. Is there maybe a config parameter forwandb
that should be used when using a hyperparameter search scheduler in Ray (s.th. like `resume="must``?The file that is not found is actually there:
Is there maybe a very slow write process such that the
wandb
read process has no such file available, yet, when trying to read it?What you expected to happen
That
wandb
logs for each of the 10 runs I have the iterations in sequence (from 1 - 500) and does not error out. Basically, that all necessary information for logging is available towandb
when using a scheduler in Raytune
and that not the user has to figure out, if anotherwandb
configuration needs to be used with a scheduler.Versions / Dependencies
Ubuntu 20.0 Python 3.9 Ray Image 2.10.0.d8b3d6-py39-gpu`
Reproduction script
Here is the cluster YAML used:
Issue Severity
Medium: It is a significant difficulty but I can work around it.