Training PGD but always showing Loss=nan

stanny880913 commented 1 year ago

I'm runnung CUDA_VISIBLE_DEVICES=0,1,2,3 python scripts/train_dwn.py --workers_per_gpu 2 --samples_per_gpu 256 --num_gpus 4 --epochs 200 --dir_data data/nuscenes/fusion_data/dwn_radiant_pgd ,when start training,the first loss show 1.xxx, but the second one start became nan,why this happening? Thank you Screenshot from 2023-05-12 14-22-06

longyunf commented 1 year ago

I did not see this in my experiment. You can try using a single GPU or running multiple times to check if this always happens. I also provide pretrained weights of DWN.

stanny880913 commented 1 year ago

I did not see this in my experiment. You can try using a single GPU or running multiple times to check if this always happens. I also provide pretrained weights of DWN.

I fixed it!!by using bigger batch_size, maybe the steps is too small that cause gradiant descent, but when im going to use val and test data to do_eval,it both show similar error like AssertionError: The length of results is not equal to the dataset len: 18024 != 36048 but why the dataset number will be wrong??? the code stuck at prog_bar = mmcv.ProgressBar(len(dataset))

longyunf commented 1 year ago

Make sure that arguments test_samples_per_gpu=1 and num_gpus=1 for running evaluation.

stanny880913 commented 1 year ago

Make sure that arguments test_samples_per_gpu=1 and num_gpus=1 for running evaluation.

 parser.add_argument('--num_gpus', type=int, default=1)
 parser.add_argument('--samples_per_gpu', type=int, default=1)
 parser.add_argument('--test_samples_per_gpu', type=int, default=1)
 parser.add_argument('--workers_per_gpu', type=int, default=2)

I set these args like this, but it's still raise error AssertionError: The length of results is not equal to the dataset len: 18024 != 36048 and sorry , the error code is stuck at fusion_dataset.py format_results_all_cams, how to solve it!!

longyunf commented 1 year ago

This shows that only detections for 18024 images are obtained but there are 36048 images in total. You may check the number of images in dataloader by len(dataloader.dataset) in the function single_gpu_test.

stanny880913 commented 1 year ago

This shows that only detections for 18024 images are obtained but there are 36048 images in total. You may check the number of images in dataloader by len(dataloader.dataset) in the function single_gpu_test.

I checked it!!! it's show that len(dataloader.dataset) = 36048, it's the same, but why it's only run 18024, only a half, val and test are the same result!

longyunf commented 1 year ago

In the single_gpu_test funtion, you can further exam whether the length of the list results increase by one after each loop of running the model.

stanny880913 commented 1 year ago

In the single_gpu_test funtion, you can further exam whether the length of the list results increase by one after each loop of running the model.

I print len(result) to test whether the length of the list results increase, but it's always show 1 how to fixed it!!

def single_gpu_test(model, model_mlp, data_loader):
    """Test model with single gpu.

    Args:
        model (nn.Module): Model to be tested.
        data_loader (nn.Dataloader): Pytorch data loader.

    Returns:
        list[dict]: The prediction results.
    """
    print("Into single gpu test!!!!")
    model.eval()
    model_mlp.eval()
    results = []
    dataset = data_loader.dataset
    #dataset_len = 36114
    #BUG!!
    prog_bar = mmcv.ProgressBar(len(dataset))
    # prog_bar = mmcv.ProgressBar(18057)
    print("progbar = ",prog_bar)
    for i, data in enumerate(data_loader):
        with torch.no_grad():
            result = model(model_mlp=model_mlp,
                           return_loss=False, rescale=True, **data)

        results.extend(result)
        print("test result len :",len(result))
        batch_size = len(result)
        # batch_size = 8
        for _ in range(batch_size):
            prog_bar.update()
    print("results_len = ",len(results))
    return results

I set print at here!!

longyunf commented 1 year ago

Length of results will increase. The result is single image detection and its length is 1.

stanny880913 commented 1 year ago

Length of results will increase. The result is single image detection and its length is 1.

Sorry !!I print the wrong thing! results are always increasing, what can I do for the next steps!! I really don't know why its stopped it ~

longyunf commented 1 year ago

assert i==len(results)

stanny880913 commented 1 year ago

assert i==len(results)

Sorry , may I ask where to put it to fix the problem? Thx

def single_g``` pu_test(model, model_mlp, data_loader): """Test model with single gpu.

Args:
    model (nn.Module): Model to be tested.
    data_loader (nn.Dataloader): Pytorch data loader.

Returns:
    list[dict]: The prediction results.
"""
print("Into single gpu test!!!!")
model.eval()
model_mlp.eval()
results = []
print("init = ", len(results))
dataset = data_loader.dataset
# dataset_len = 36114
# BUG!!
prog_bar = mmcv.ProgressBar(len(dataset))
# prog_bar = mmcv.ProgressBar(18057)
for i, data in enumerate(data_loader):
    with torch.no_grad():
        result = model(model_mlp=model_mlp,
                       return_loss=False, rescale=True, **data)

    assert i == len(results), (
        'The length of results is not equal to i: {} != {}'.
        format(len(results), i))

    results.extend(result)

    print("/n")
    print("=======")
    print("test result len :", len(results))
    print("=======")
    batch_size = len(result)
    # batch_size = 8
    for _ in range(batch_size):
        prog_bar.update()
print("results_len = ", len(results))
return results



I put it here, is it right? Thx

longyunf commented 1 year ago

The line before results.extend(result)

stanny880913 commented 1 year ago

The line before results.extend(result)

I added it,it's dosen't raise this assert message, when i = 18057,it will jump into evaluate function then raiseAssertionError: The length of results is not equal to the dataset len: 18057 != 36114 when i --do_eval on val dataset , by the way , len(data_loader.dataset) is 36114

longyunf commented 1 year ago

You may check why the loop terminated when i=18057, which is unexpected if (data_loader.dataset) is 36114.

stanny880913 commented 1 year ago

You may check why the loop terminated when i=18057, which is unexpected if (data_loader.dataset) is 36114.

Thx for your help!!!

longyunf / radiant

Training PGD but always showing Loss=nan #3