Closed Wulin-Tan closed 5 months ago
I haven't seen this error before. Were you by any chance able to see how much GPU memory was in use? One thing I spotted was in your config file you have
model:
losses_to_use:
- temporal
- pca_singleview
though in the data portion of your config file, you have
data
columns_for_singleview_pca: null
which might be causing some issue? So you should either remove the pca loss from the list, or list out the keypoint indices you want to use in columns_for_singleview_pca
.
To test this, you can also try setting
training:
train_frames: 100
which will train a model but only with 100 frames. That way you can debug a bit faster, without training the model on all the labeled frames.
Finally, I would also suggest trying
data:
image_resize_dims:
height: 384
width: 384
to see if that provides adequate results; that might speed up training quite a bit too.
I'll note that we're currently working on some updates for situations like yours, where the original image size is very large but the animal only occupies a small portion of the frame. We'll have a two-step pipeline, an object detection network that finds the animal, then crops around the animal, then a second pose estimation network that operates on the crop. This is still at least a month out but wanted to let you know it's in the pipeline.
Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?
I haven't seen this error before. Were you by any chance able to see how much GPU memory was in use? One thing I spotted was in your config file you have
model: losses_to_use: - temporal - pca_singleview
though in the data portion of your config file, you have
data columns_for_singleview_pca: null
which might be causing some issue? So you should either remove the pca loss from the list, or list out the keypoint indices you want to use in
columns_for_singleview_pca
.To test this, you can also try setting
training: train_frames: 100
which will train a model but only with 100 frames. That way you can debug a bit faster, without training the model on all the labeled frames.
Finally, I would also suggest trying
data: image_resize_dims: height: 384 width: 384
to see if that provides adequate results; that might speed up training quite a bit too.
I'll note that we're currently working on some updates for situations like yours, where the original image size is very large but the animal only occupies a small portion of the frame. We'll have a two-step pipeline, an object detection network that finds the animal, then crops around the animal, then a second pose estimation network that operates on the crop. This is still at least a month out but wanted to let you know it's in the pipeline.
Heard that you are working on this update, that is super cool! And I think this 2-step idea is important. sometimes we just need the animals rough location/area, then we can stop at step 1. And if we want more details about the bodyparts, we can move on with step 2. And after step 1 localization of the animal, step 2 prediction can be narrowed down to a smaller specific area, this would help a lot to the efficiency. And I also hope that this feature can support multiple view / multiple animals.
Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?
What did you change the numbers to? No, they do not need to be the same, but it's possible some memory issue was cleared up if you made one or both smaller.
Hi, @themattinthehatt Thank you for your advice. I figured out the problem. The error seemed like a 'traffic jam' in the GPU. So I checked the config file and found that it might be the problem about train_batch_size in the training part(I called it training batch size) and context/train/batch_size in the dali part(I called it dali loading batch size). I switched them to the same number but still within GPU memory, and the problem solved. So must these two number be the same?
What did you change the numbers to? No, they do not need to be the same, but it's possible some memory issue was cleared up if you made one or both smaller.
@themattinthehatt As I showed the config at the beginning on this issue: training batch size 8, dali loading batch size 16, then it would cause the problem I raised here(Epoch 22 would show the error). training batch size 8, dali loading batch size 8, then it could move on with the training until Epoch 300 as in the config.
Cool, good to hear. It seems like maybe there was an issue with the memory? I bet if you changed dali batch size to 10 or something it should also be fine (but not necessary)
Hi, lightning pose team: I have about 2,000 labeled images from each video (video_1, video_2, totally about 4,000 images). Now I want to train the temporal model. For the first 21 epochs, the program can run well, but when it went to the 22nd epoch, it threw the error and terminated. Any suggestion to fix this?
error
config file
hydra_train.py
hardware
GPU L20(48GB) * 1 CPU 20 vCPU Intel(R) Xeon(R) Platinum 8457Cso I run the code to check:
code to check
import os import torch import torch.nn.functional as F # Set environment variables for debugging os.environ['CUDA_LAUNCH_BLOCKING'] = '1' os.environ['HYDRA_FULL_ERROR'] = '1' # Example upsample function with debugging def upsample(inputs): assert inputs.ndim == 4, "Expected 4D tensor for upsampling" print(f"Upsampling input shape: {inputs.shape}") return F.interpolate(inputs, scale_factor=2, mode='bilinear', align_corners=True) # Example filter2d function with debugging def filter2d(input, kernel): assert kernel.ndim == 4, "Expected 4D kernel for filtering" assert input.ndim == 4, "Expected 4D input for filtering" print(f"Filtering input shape: {input.shape}, kernel shape: {kernel.shape}") return F.conv2d(input, kernel, padding=1, stride=1) # Your training code with enhanced debugging try: # Initialize model, optimizer, data loaders, etc. for epoch in range(num_epochs): for batch_idx, batch in enumerate(train_loader): try: # Debugging data batch print(f"Batch {batch_idx} shape: {batch['images'].shape}") # Forward pass outputs = model(batch) # Compute loss loss = criterion(outputs, batch) # Backward pass and optimization optimizer.zero_grad() loss.backward() optimizer.step() if batch_idx % 100 == 0: print(f"Epoch [{epoch}/{num_epochs}], Step [{batch_idx}/{len(train_loader)}], Loss: {loss.item()}") except Exception as batch_error: print(f"Error in batch {batch_idx}: {batch_error}") import traceback traceback.print_exc() except Exception as e: print(f"Error during training: {e}") import traceback traceback.print_exc()and got the result:
num_epoch not defined
Error during training: name 'num_epochs' is not defined Traceback (most recent call last): File "/tmp/ipykernel_97824/1395678165.py", line 25, in