ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

log is changed in the new version of pytorch lightning #170

Closed JiahaoYao closed 2 years ago

JiahaoYao commented 2 years ago

in the pytorch-lightning 1.5, the running of this gives

v/home/ray/anaconda3/envs/wrong/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py:1580: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
  "GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`."

  | Name | Type   | Params
--------------------------------
0 | lin1 | Linear | 24    
1 | lin2 | Linear | 9     
--------------------------------
33        Trainable params
0         Non-trainable params
33        Total params
0.000     Total estimated model params size (MB)
/home/ray/anaconda3/envs/wrong/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py:408: UserWarning: The number of training samples (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  f"The number of training samples ({self.num_training_batches}) is smaller than the logging interval"
{'val_loss_step': 1.9480565786361694, 'val_bar_step': 5.677999973297119, 'val_loss_epoch': 1.9022094011306763, 'val_bar_epoch': 5.677999973297119, 'avg_val_loss': 1.9022094011306763, 'val_foo': 1.2339999675750732}

but in the new version, the outputs no longer has the valiation?

/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1933: PossibleUserWarning: The number of training batches (10) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
loss tensor(0.6672, grad_fn=<BinaryCrossEntropyBackward0>)
[W reducer.cpp:1251] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
loss tensor(0.6713, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.5819, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.5569, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.5124, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.4639, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.4529, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.3879, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.4012, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.3257, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.3559, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.2741, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.3155, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.2305, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.2790, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.1930, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.2459, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.1607, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.2157, grad_fn=<BinaryCrossEntropyBackward0>)
loss tensor(0.1330, grad_fn=<BinaryCrossEntropyBackward0>)
{'loss': tensor(0.1330), 'val_foo': tensor(1.2340)}

i.e. no valiation is running there.

reproduction scripts:

def test_metrics():
    """Tests if metrics are returned correctly"""
    tmpdir = '/home/ray/default/ray_lightning/ray_lightning/tests'
    model = XORModel()
#     strategy = RayStrategy(num_workers=1, find_unused_parameters=False)
    trainer = get_trainer(
        tmpdir,
        strategy='ddp',
        max_epochs=10,
        num_sanity_val_steps=10,
        reload_dataloaders_every_n_epochs=1)
    dataset = XORDataModule()
    trainer.fit(model, dataset)
    callback_metrics = trainer.callback_metrics
    logged_metrics = trainer.logged_metrics
    print(logged_metrics)
    assert callback_metrics["avg_val_loss"] == logged_metrics["avg_val_loss"]
    assert logged_metrics["val_foo"] == torch.tensor(1.234)
    assert callback_metrics["val_foo"] == torch.tensor(1.234)
    # forked name is used for on_step logged metrics
    forked_name_loss = "val_loss" + "_step"
    forked_name_bar = "val_bar" + "_step"
    assert forked_name_loss in logged_metrics.keys()
    assert logged_metrics[forked_name_bar] == torch.tensor(5.678)
    # callback_metrics doesn't record on_step metrics
    assert forked_name_loss not in callback_metrics.keys()
    assert forked_name_bar not in callback_metrics.keys()

test_metrics()
JiahaoYao commented 2 years ago

tag: https://github.com/Lightning-AI/lightning/issues/13188