nv-tlabs / NKSR

[CVPR 2023 Highlight] Neural Kernel Surface Reconstruction
https://research.nvidia.com/labs/toronto-ai/NKSR
Other
735 stars 43 forks source link

wandb: Waiting for W&B process to finish... (failed -1) #47

Open jmulsy opened 11 months ago

jmulsy commented 11 months ago

Hello, first of all, thank you for your outstanding work. Then, I have a problem and need your help. When I train the model, using Wandb, whether online or offline, this problem always occurs: wandb: Waiting for W&B process to finish... (failed -1) I don't know what caused this problem. Could you give some suggestions?

heiwang1997 commented 11 months ago

Thanks for your interest in our work. Could you please paste a more complete log of the code run?

jmulsy commented 11 months ago

Thank you for your reply. Here are my questions:

Firstly, To use Wandb offline, I wrote this code in train.py:

image-20230929192209125

When I input the following code for training:

python train.py configs/shapenet/train_3k_noise.yaml

the terminal showed:

09-29 18:58:24 (train.py:72) [INFO] Intelligent GPU selection: 0 
wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 6zddnqd7.
wandb: Tracking run with wandb version 0.15.10
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
Global seed set to 0
/home/lisy/anaconda3/envs/nksr/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:447: LightningDeprecationWarning: Setting `Trainer(gpus=1)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=1)` instead.
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
 >>>> ======= MODEL HYPER-PARAMETERS ======= <<<< 
exec: null
include: null
visualize: false
test_set_shuffle: false
...
...
...
...
...
...
  random_seed: fixed
_shapenet_transforms:
- name: PointcloudNoise
  args:
    stddev: 0.005
- name: SubsamplePointcloud
  args:
    'N': 3000

 >>>> ====================================== <<<< 
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
09-29 18:58:38 (train.py:316) [INFO] 

Wandb Run of nkfw-shapenet/6zddnqd7 (with name noise_3k/0929-big-vehicle) marked to be cleared.

wandb: Waiting for W&B process to finish... (failed -1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/lisy/NKSR/wandb/offline-run-20230929_185825-6zddnqd7
wandb: Find logs at: ./wandb/offline-run-20230929_185825-6zddnqd7/logs

Then, I used 'wandb sync', and the terminal showed that:

Find logs at: /home/lisy/NKSR/wandb/debug-cli.lisy.log
Syncing: https://wandb.ai/lisy0408/nkfw-shapenet/runs/6zddnqd7 ... done.

Lastly, I went to the link of Wandb and the result was as follows:

image-20230929191018653

When cancel the 'offline', the result of failure also appears:

- name: PointcloudNoise
  args:
    stddev: 0.005
- name: SubsamplePointcloud
  args:
    'N': 3000

 >>>> ====================================== <<<< 
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
09-29 19:12:25 (train.py:316) [INFO] 

Wandb Run of lisy0408/nkfw-shapenet/mud4kkdq (with name noise_3k/0929-grave-angle) marked to be cleared.

wandb: Waiting for W&B process to finish... (failed -1). Press Control-C to abort syncing.
wandb: 🚀 View run noise_3k/0929-grave-angle at: https://wandb.ai/lisy0408/nkfw-shapenet/runs/mud4kkdq
wandb: ️⚡ View job at https://wandb.ai/lisy0408/nkfw-shapenet/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwMDY5MTM5OA==/version_details/v3
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230929_191209-mud4kkdq/logs