YOLOv8 3D Object detection in WOD (Waymo Open Dataset)

atanasko commented 1 year ago

Search before asking

[X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

Hi,

I would like to play with WOD (Waymo Open Dataset) and to detect object in LIDAR data. I already find #1058 and #1765, but still have some ambiguity. If I understand well in order to train and model to detect objects in LIDAR data from Waymo Open Dataset, I'll need to train this model to do that with BEV(Bird's Eye View) image. But in order to do that I'll need to write new Dataset class that will load data from WOD, and than also to find the right Loss function. Is this the correct path? Am I missing something here?

I already look into Complex-YOLO, Complex-YOLOv3 and Complex-YOLOv4-Pytorch repositories.

Will I need to modify something in the YOLOv8 structure?

Thanks in advance, Atanasko

Additional

No response

glenn-jocher commented 1 year ago

@atanasko hi Atanasko,

Yes, you are correct. In order to train and model to detect objects in LIDAR data from Waymo Open Dataset, you will need to create a new Dataset class that will load data from the dataset and convert it to BEV (Bird's Eye View) image. You will also need to select an appropriate Loss function to train the model.

There should be no need to modify the YOLOv8 structure itself, just create the appropriate Dataset class to preprocess the data for training.

Let us know if you have any further questions.

Best, Glenn

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

Thank you! I'll try and come back to you if some more questions. I'll create a fork of YOLOv8 repository so my work can be seen there.

Best regards, Atanasko

glenn-jocher commented 1 year ago

@atanasko You're welcome! Please feel free to create a fork of the YOLOv8 repository and work from there. If you have any questions or run into any issues, don't hesitate to ask for help. We're happy to assist you.

Good luck with your project!

Best regards, Glenn

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

Is building dataset configurable? From what find in the the code of DetectionTrainer class building dataset is not configurable I will need to modify that part of the code right?

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Yes, you're right that the current implementation of the DetectionTrainer class in YOLOv8 does not allow for easy configuration of the dataset building process. In order to modify the dataset building process, you'll need to make modifications to the code of the DetectionTrainer class.

One way to accomplish this would be to create a subclass of the DetectionTrainer class and override its methods for building the dataset, such as the prepare_data() method. This will allow you to modify the dataset building process to suit your specific needs and use it in your training pipeline.

I hope this explanation helps. Let me know if you have any further questions or concerns.

Best regards, Glenn

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

Thank you for your time and your support! If I understand well in the YOLO class TASK_MAP for 'detect' entry l should substitute the new created DetectionTrainer (ex. WodDetectionTrainer) and than in the new trainer I should build new dataset right? And afterwards to dive in the code to see how the data should be appropriately loaded?

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Yes, you are correct! In the YOLO class TASK_MAP, you should substitute the newly created DetectionTrainer (ex. WodDetectionTrainer) so that it is used in the 'detect' task.

Once you have created the new DetectionTrainer, you can build a new dataset within the new trainer by defining a new dataset class for the specific data that you want to use (in your case, WOD data). You can then update the prepare_data() method within your new DetectionTrainer class to use the new dataset and preprocess the data as needed for training.

Afterwards, you may need to modify the code to correctly load and use the data within the train_dataloader() and val_dataloader() methods as needed for your specific task.

I hope this helps! Let us know if you have any further questions or concerns.

Best regards, Glenn

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

Thank you! I'll try as advised.

Best regards, Atanasko

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

I did not find any prepare_data() method within the DetectionTrainer i only override
def build_dataset(self, img_path, mode='train', batch=None): method. In the new WodDataset I override

    def get_img_files(self, img_path):
        pass

    def get_labels(self):
        pass

and I assume get_img_files should return array of BEV images and get_labels should return array of training labels right?

Thanks in advance, Atanasko

Tottowich commented 1 year ago

Hi @atanasko Have you been able to make any progress on this feature?

atanasko commented 1 year ago

@Tottowich hi,

I do update the code as Glenn advised, and I was able to debug the code. I still analyse and try to understand overall flow, and I do have some code from other project that create BEV image from WOD data. So, I do have some progress, but still some peaces of understanding are missing ex. I did not find latest methods Glenn pointed me to, so still analyzing that part of the code.

Best regards, Atanasko

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

I do work further to do the changes in the code and I do the following:

I modify loading process so I cache all labels and images in "ram" because WOD v2 store information in "dask" format so initially I try to avoid extract images from "dask" frames and store to files in file system. For simplicity I use the same files to test the code and it look like labels and images are read into "ram" Into im_files list I put names by some convention ## and into label dictionary im_file i put the same value. Into ims list I put the BEV images. During the caching process I have the following WARNING

train: Caching images (0.7GB ram): 100%|██████████| 3/3 [01:05<00:00, 21.76s/it]
[ WARN:0@90.090] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083484548776'): can't open/read file: check file path/integrity
[ WARN:0@90.091] global loadsave.cpp:244 findDecoder imread_('10023947602400723454_1120_000_1140_000#1#1552440204762531'): can't open/read file: check file path/integrity
[ WARN:0@90.091] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083471746058'): can't open/read file: check file path/integrity
[ WARN:0@90.091] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083478848939'): can't open/read file: check file path/integrity
[ WARN:0@90.093] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083483248542'): can't open/read file: check file path/integrity
[ WARN:0@90.093] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083477248893'): can't open/read file: check file path/integrity
[ WARN:0@90.093] global loadsave.cpp:244 findDecoder imread_('10017090168044687777_6380_000_6400_000#1#1550083486448634'): can't open/read file: check file path/integrity
[ WARN:0@90.093] global loadsave.cpp:244 findDecoder imread_('10023947602400723454_1120_000_1140_000#1#1552440213362519'): can't open/read file: check file path/integrity
[ WARN:0@90.094] global loadsave.cpp:244 findDecoder imread_('10023947602400723454_1120_000_1140_000#1#1552440208062640'): can't open/read file: check file path/integrity

and afterwards I have an error

 File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/base.py", line 152, in load_image
    raise FileNotFoundError(f'Image Not Found {f}')
FileNotFoundError: Image Not Found 10017090168044687777_6380_000_6400_000#1#1550083484548776

Any idea where the problem can come from?

Thanks in advance, Atanasko

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

I think I find my problem, not all images are correctly put in the ims list. Let me try to find the problem.

glenn-jocher commented 1 year ago

Hi @atanasko,

It seems that the issue you're encountering is related to images not being correctly added to the ims list during the caching process. This can be the cause of the subsequent "Image Not Found" error.

To troubleshoot this issue, I would recommend checking the following:

Verify that the images you are attempting to load actually exist in the expected file path. Double-check the file path and ensure that all the necessary images are present.
Make sure that the image filenames match the convention you mentioned: <segment>#<laser name>#<timestamp microseconds>. Ensure that the filenames in the im_files list correspond correctly to the actual image files.
Check the image loading process. Make sure that the images are being read and added to the ims list correctly. You can try printing out the filenames and checking if the images are successfully loaded.
Inspect the specific images that are throwing the "can't open/read file" warning. It's possible that there may be some issues with those particular images, such as file corruption or incorrect file format.

By carefully reviewing these points, you should be able to identify where the problem lies and address it accordingly.

Best regards, Glenn

atanasko commented 1 year ago

@glenn-jocher hi Glenn,

Thank you! I'm investigating the problem.

Best regards, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Thank you for investigating the issue. In order to troubleshoot the problem further, here are a few suggestions you can try:

Verify that the images you are attempting to load actually exist in the expected file path.
Check that the image filenames match the convention of <segment>#<laser name>#<timestamp microseconds>.
Double-check the image loading process to ensure that the images are being read and added to the appropriate list correctly.
Look into the specific images that are throwing the "can't open/read file" warning and check for any file corruption or incorrect file format.

By carefully reviewing these steps, you should be able to identify the root cause of the issue and address it accordingly.

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

I'm stuck with the next error:

Transferred 319/355 items from pretrained weights
TensorBoard: Start with 'tensorboard --logdir /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/runs/detect/train5', view at http://localhost:6006/
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
train: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/wod/data/training/labels.cache... 596 images, 0 backgrounds, 0 corrupt: 100%|██████████| 3/3 [00:00<?, ?it/s]
train: Caching images (0.7GB ram): 100%|██████████| 3/3 [01:05<00:00, 21.86s/it]
val: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/wod/data/training/labels.cache... 596 images, 0 backgrounds, 0 corrupt: 100%|██████████| 3/3 [00:00<?, ?it/s]
val: Caching images (0.7GB ram): 100%|██████████| 3/3 [01:06<00:00, 22.28s/it]
Plotting labels to /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/runs/detect/train5/labels.jpg... 

optimizer: AdamW(lr=0.001111, momentum=0.9) with parameter groups 57 weight(decay=0.0), 64 weight(decay=0.0005), 63 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/runs/detect/train5
Starting training for 100 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  0%|          | 0/38 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/train.py", line 8, in <module>
    results = model.train(data="wod.yaml", epochs=100, imgsz=640, cache='ram')
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/engine/model.py", line 374, in train
    self.trainer.train()
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/engine/trainer.py", line 192, in train
    self._do_train(world_size)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/engine/trainer.py", line 313, in _do_train
    for i, batch in pbar:
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/build.py", line 38, in __iter__
    yield next(self.iterator)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/base.py", line 239, in __getitem__
    return self.transforms(self.get_image_and_label(index))
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/augment.py", line 56, in __call__
    data = t(data)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/augment.py", line 56, in __call__
    data = t(data)
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/augment.py", line 91, in __call__
    indexes = self.get_indexes()
  File "/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/yolo/data/augment.py", line 144, in get_indexes
    return random.choices(list(self.dataset.buffer), k=self.n - 1)
  File "/usr/lib/python3.10/random.py", line 519, in choices
    return [population[floor(random() * n)] for i in _repeat(None, k)]
  File "/usr/lib/python3.10/random.py", line 519, in <listcomp>
    return [population[floor(random() * n)] for i in _repeat(None, k)]
IndexError: list index out of range

any advice where the problem can come from?

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

It seems that you are encountering an error during the training process of YOLOv8. The specific error message indicates an IndexError with the following traceback:

IndexError: list index out of range

This error typically occurs when accessing an index that is outside the range of a list or array. In your case, it appears that this error is happening in the DataLoader worker process 0. The exact line causing the issue is:

return random.choices(list(self.dataset.buffer), k=self.n - 1)

A possible explanation for this error could be that the self.dataset.buffer list is empty or does not contain enough elements to satisfy the condition k=self.n - 1. This can happen if the dataset or buffer is not correctly initialized or if there is an issue with the indexing logic in the dataset code.

To troubleshoot this issue, I would suggest checking the following:

Verify that the dataset and buffer are correctly initialized and populated with the necessary data.
Double-check the indexing logic in the dataset code to ensure that it is correctly handling the buffer.
Review any recent changes made to the code or the dataset that could have caused this issue.
Consider printing out relevant variables such as the length of the buffer and the indices being accessed to further investigate the issue.

By carefully reviewing these points, you should be able to identify the root cause of the IndexError and resolve the problem.

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

Thanks! Let me try to find the problem.

Best regards, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Thank you for your response and for looking into the issue. It's great to see your proactive approach in troubleshooting the problem.

To further investigate the error you're encountering, I suggest focusing on the IndexError message and traceback that you shared. This error typically occurs when trying to access an index that is outside the range of a list or array.

In your case, the error is happening in the DataLoader worker process 0. The specific line causing the issue is related to accessing elements from a buffer list. It seems that the buffer list may not be properly initialized or there might be an issue with the indexing logic in the dataset code.

To resolve the issue, I recommend checking the following:

Ensure that the dataset and buffer are correctly initialized and populated with the required data.
Review the indexing logic in the dataset code to ensure it correctly handles the buffer list.
Double-check any recent changes made to the code or the dataset that might have caused this issue.
Investigate the length of the buffer list and the indices being accessed to identify any discrepancies.

By thoroughly examining these points, you should be able to identify and address the root cause of the IndexError.

Please let me know if you have any further questions or need additional assistance.

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

Thank you so much for your support! I appreciate it!

I think the problem was that I was caching image is "ram" and "augmentation" was enabled, so in the load_image method in the BaseDataset buffer was not correctly initialized (not sure if that will be a case if I use other dataset). I disable "augmentation" (for the moment) and move further. There are some other errors now but I'll try to resolve them.

Best regards, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

You're welcome! I'm glad to hear that you were able to identify the issue and make progress.

It seems that the problem was related to caching images in "ram" while the "augmentation" option was enabled. This combination caused the buffer in the load_image method of the BaseDataset to not be properly initialized. Disabling the "augmentation" option resolved the issue for now.

Now that you've resolved one issue, it's common to encounter additional errors as you continue with the training process. It's important to carefully review any new errors and their corresponding error messages to identify the underlying causes and address them accordingly.

If you have any further questions or need assistance with the new errors, feel free to ask. I'm here to help.

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

I think I resolve all the other problems, and training process is started, but I face the out of memory error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 873.77 MiB already allocated; 18.75 MiB free; 892.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception in thread Thread-38 (_pin_memory_loop):

My system have Nvidia Quadro RTX 4000 GPU with 8GB RAM, and 64GB system RAM memory. I try to cache images in "ram", and I read the requirements for yolov8 that GPU with 8GB RAM is required. My question is, are images cached in system RAM memory or in GPU RAM memory. Can that be the root cause of this error? I trigger the training with

train: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/wod/data/training/lidar... 596 images, 0 backgrounds, 0 corrupt: 100%|██████████| 3/3 [00:37<00:00, 12.59s/it]
train: New cache created: /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/wod/data/training/labels.cache
train: Caching images (0.7GB ram): 100%|██████████| 3/3 [01:07<00:00, 22.40s/it]
val: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/wod/data/training/labels.cache... 596 images, 0 backgrounds, 0 corrupt: 100%|██████████| 3/3 [00:00<?, ?it/s]
val: Caching images (0.7GB ram): 100%|██████████| 3/3 [01:06<00:00, 22.16s/it]

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Great job on resolving the previous issues and getting the training process started! I understand that you are now encountering an out-of-memory error during training. The error message indicates a CUDA out-of-memory error, which suggests that the GPU memory is insufficient to allocate the required memory for the training process.

In your case, you mentioned that you have an Nvidia Quadro RTX 4000 GPU with 8GB of RAM. The YOLOv8 requirements state that a GPU with 8GB of RAM is necessary.

Regarding your question about image caching, by default, YOLOv8 does not cache images in the GPU RAM. Instead, it caches the images in the system RAM memory ("ram" mode). However, keep in mind that during training, the model and intermediate tensors are stored in the GPU memory, which can quickly consume the available GPU RAM.

To address the out-of-memory error, there are a few potential solutions you can try:

Reduce batch size: Decrease the batch size passed to the training process. A smaller batch size reduces the amount of memory required by each iteration.
Enable gradient accumulation: Instead of updating the model's weights after every batch, you can accumulate gradients over several batches and perform an update less frequently. This reduces the memory requirement for each update step.
Use mixed-precision training: Switching to mixed-precision training can reduce the memory footprint of the model by utilizing lower-precision data types.

You can experiment with these techniques to see if they help alleviate the out-of-memory error. Remember to adjust other training parameters, such as learning rate and number of iterations, accordingly.

I hope this explanation helps you resolve the issue. Let me know if you have any further questions. Good luck with your training process!

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher ,

Thank you!

How can I "Enable gradient accumulation" and "Use mixed-precision training"?

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

You're welcome! I'm happy to help.

To "Enable gradient accumulation," you can accumulate gradients over several batches before updating the model's weights. This means that instead of updating the model after every batch, you update it after a certain number of batches. Accumulating gradients in this way reduces the memory required for each update step and can help alleviate the out-of-memory error you're experiencing. In your training code, you can set the gradient_accumulation_steps parameter to the desired number of batches for gradient accumulation. This will depend on your specific requirements and available memory.

To "Use mixed-precision training," you can utilize lower-precision data types for training. By default, models use single-precision floating-point numbers (FP32). With mixed-precision training, you can use a combination of lower-precision (FP16) and higher-precision (FP32) numbers, which reduces the memory footprint of the model. To enable mixed-precision training, you can use the Automatic Mixed Precision (AMP) feature provided by PyTorch. This feature automatically manages the use of lower-precision data types, optimizing memory usage during training.

Please note that the exact implementation details may vary depending on the training framework and codebase you are using. It's recommended to refer to the documentation or examples specific to your training framework (PyTorch or others) to get more detailed instructions on how to enable gradient accumulation and mixed-precision training.

I hope this explanation helps. Let me know if you have any further questions or need assistance. Good luck with your training!

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher

I find that

so it look to me that the problem is is something with caching labels not with the model and loading weights. I'll investigate this further.

Thanks, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Thank you for bringing this to our attention. Based on the image you shared, it appears that you are encountering a problem with caching labels, rather than an issue with the model or loading weights.

To investigate further, I recommend focusing on the label caching process. Double-check the code responsible for caching labels and ensure that it is correctly set up to handle the labels for your specific dataset.

You may want to verify that the labels are being properly read and processed, and that they correspond to the correct images in the dataset. Additionally, check for any potential issues with the label file format or any inconsistencies in the parsing logic.

By thoroughly examining the label caching process, you should be able to identify and address the root cause of the problem.

If you require further assistance or have any other questions, please don't hesitate to ask.

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

I find the root cause of the problem. It's in memory consumption when WOD range image is converted to Tensor. I'll look there and see how to resolve that problem. Thanks for your support!

Best regards, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

I'm glad to hear that you've identified the root cause of the memory consumption issue involving the conversion of the WOD range image to a Tensor. Depending on your specific code and the size of your range images, there may be different strategies to effectively manage memory usage during this process.

Please remember that the conversion to a Tensor is a crucial process, but it can indeed be memory-intensive, especially with large image sizes or when processing many images concurrently. You may want to consider techniques such as batch processing or on-the-fly image loading to help manage memory usage.

I'm confident that you'll be able to find an effective solution to this issue and make progress with your project. Please feel free to reach out if you have any other questions or need further assistance. Your perseverance and progress are admirable!

Best regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

I have an other try. I try with an other machine with 2 GPU (NVIDIA GeForce RTX 3080 and NVIDIA Quadro RTX 4000). I also add following parameters -m torch.distributed.run --nproc_per_node 2 --device 0,1

but I sill have out of memory error

Any advice how to proceed?

Thanks in advance, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Thank you for sharing this information. I can understand how out-of-memory errors could be very challenging particularly when you have multiple GPUs available and yet they still occur.

The error message indicates that your process is still running out of GPU memory during training. Using multiple GPUs should indeed allow you to train larger models or use larger batch sizes by distributing the load; however, it's also essential to ensure your setup correctly utilizes multi-GPU configuration.

The "--nproc_per_node 2 --device 0,1" option you're using is a good starting point as it indicates to PyTorch to use both of the available GPUs. However, you should keep in mind that the data and model will then be split across both devices, and each one will still require enough memory to handle its share.

There are a few approaches you can take to try and mitigate these memory-related issues.

Try to decrease the batch size. This is often the most straightforward step to reduce memory usage, as smaller batches will require less memory. Even when spreading the load across multiple GPUs, each batch is typically divided amongst the available devices, so smaller batches will use less memory on each GPU.
You can also try enabling gradient checkpointing, which is a method that trades compute for memory. It would lengthen your training time a bit but it might help with the memory issue.
Another option is to try mixed precision training, which mostly uses half-precision floats, reducing both the memory requirements and the compute time on modern GPUs.

Remember to keep an eye on the GPU memory usage (for instance using nvidia-smi) during the training process to see how it changes and what might be causing any spikes that lead to out-of-memory errors.

I hope this helps. Happy coding!

Best Regards, Glenn

atanasko commented 1 year ago

hi @glenn-jocher,

Thanks once again for your support!

My previous info was misleading, sorry! Problem was not in the model and PyTorch configuration but in reading data from WOD. WOD is using TensorFlow and GPU memory allocation from TensorFlow during reading information from WOD was "greedy" and consume a lot of memory. It take me a while to understand. So what I have done I set TensorFlow Memory Growth and unlock the training process

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

Anyhow process was not completely successful using two GPU's, it finished when I use only a single one (I'll investigate the problem with the GPU's later). So for the moment training process was finished successfully with small amount of data, I will go over the code changes to verify the data are correctly read and will try to retrain the model with more data from WOD. I'll report my progress.

Best regards, Atanasko

glenn-jocher commented 1 year ago

Hi @atanasko,

Thank you for sharing this update. It's great to hear that you've identified and solved the issue related to memory consumption during the data reading process from the WOD using TensorFlow.

Setting memory growth in TensorFlow to enable a "greedy" allocation of GPU memory seems to have relieved some of the stress on your system and allowed the training process to proceed. This is a very good step in optimizing your implementation.

It's also good that you've been able to complete the training process with a smaller dataset and a single GPU, although it's unfortunate to hear that the two-GPU setup still presents challenges. Perhaps there are synchronization issues or additional memory requirements when utilizing more than one GPU.

I agree with your approach to first verify the data reading process and then proceed to train the model with a larger dataset. This step-by-step process will allow you to zero in on any problems or inefficiencies in your implementation.

I'm looking forward to hearing about your progress. Don't hesitate to reach out if you have any more updates or require any assistance moving forward.

Best regards, Glenn

github-actions[bot] commented 1 year ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

atanasko commented 1 year ago

Hi @glenn-jocher,

I spend some time trying to train the model starting from "yolov8n.pt", and look like appropriate image and label lists are filled correctly, but when I try to use the model "best.pt" I have no detection. In addition I add the training log and detect script.

**** training log ****

``` /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/venv/bin/python /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/train_wod.py WARNING ⚠️ 'ultralytics.yolo.cfg' is deprecated since '8.0.136' and will be removed in '8.1.0'. Please use 'ultralytics.cfg' instead. New https://pypi.org/project/ultralytics/8.0.180 available 😃 Update with 'pip install -U ultralytics' Ultralytics YOLOv8.0.176 🚀 Python-3.10.10 torch-2.0.1+cu117 CUDA:0 (Quadro RTX 5000, 16123MiB) engine/trainer: task=detect, mode=train, model=yolov8n.pt, data=wod.yaml, epochs=30, patience=50, batch=32, imgsz=640, save=True, save_period=-1, cache=ram, device=None, workers=8, project=None, name=None, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, show=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, vid_stride=1, stream_buffer=False, line_width=None, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, boxes=True, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0, cfg=None, tracker=botsort.yaml, save_dir=/data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/runs/detect/train Overriding model.yaml nc=80 with nc=5 from n params module arguments 0 -1 1 464 ultralytics.nn.modules.conv.Conv [3, 16, 3, 2] 1 -1 1 4672 ultralytics.nn.modules.conv.Conv [16, 32, 3, 2] 2 -1 1 7360 ultralytics.nn.modules.block.C2f [32, 32, 1, True] 3 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2] 4 -1 2 49664 ultralytics.nn.modules.block.C2f [64, 64, 2, True] 5 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2] 6 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True] 7 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2] 8 -1 1 460288 ultralytics.nn.modules.block.C2f [256, 256, 1, True] 9 -1 1 164608 ultralytics.nn.modules.block.SPPF [256, 256, 5] 10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1] 12 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1] 13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1] 15 -1 1 37248 ultralytics.nn.modules.block.C2f [192, 64, 1] 16 -1 1 36992 ultralytics.nn.modules.conv.Conv [64, 64, 3, 2] 17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1] 18 -1 1 123648 ultralytics.nn.modules.block.C2f [192, 128, 1] 19 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2] 20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1] 21 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1] 22 [15, 18, 21] 1 752287 ultralytics.nn.modules.head.Detect [5, [64, 128, 256]] Model summary: 225 layers, 3011823 parameters, 3011807 gradients Transferred 319/355 items from pretrained weights TensorBoard: Start with 'tensorboard --logdir /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/runs/detect/train', view at http://localhost:6006/ Freezing layer 'model.22.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n... AMP: checks passed ✅ train: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/datasets/wod/data/training/lidar... 199 images, 0 backgrounds, 0 corrupt: 100%|██████████| 1/1 [00:12<00:00, 12.22s/it] train: New cache created: /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/datasets/wod/data/training/labels.cache train: Caching images (0.2GB ram): 100%|██████████| 1/1 [00:22<00:00, 22.09s/it] val: Scanning /data/DEVELOPMENT/AUTONOMOUS/project/ultralytics/ultralytics/datasets/wod/data/validation/labels.cache... 595 images, 0 backgrounds, 0 corrupt: 100%|██████████| 3/3 [00:00

and "train.py"

from ultralytics import YOLO

# Load a model
model = YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)

# Train the model
results = model.train(data="wod.yaml", epochs=30, imgsz=640, batch=32, cache='ram', augment=False)  # device=[0, 1]

and "detect.py"

from PIL import Image
from ultralytics import YOLO

# Load a pretrained YOLOv8n model
model = YOLO('~/project/ultralytics/runs/detect/train/weights/best.pt')

# Run inference on 'bus.jpg'
results = model('~/project/ultralytics/img.png')  # results list

# Show the results
for r in results:
    im_array = r.plot()  # plot a BGR numpy array of predictions
    im = Image.fromarray(im_array[..., ::-1])  # RGB PIL image
    im.show()  # show image

any advice how to diagnose the problem?

Thanks in advance, Atanasko

atanasko commented 1 year ago

hi @glenn-jocher

Problem resolved. Problem was in label normalization (in range [0, 1]). With this all the problems are resolved and I successfully train yolov8 to detect objects in point cloud BEV image.

Thanks for your support! Atanasko

devendraswamy commented 1 year ago

@atanasko Can you please share the code.

atanasko commented 1 year ago

hi @devendraswamy,

Please have a look at WOD-3D-Object-Detection and my fork of ultralytics YOLOv8, wod and wod_convert_dataset branches. In the wod branch data from WOD dataset are directly loaded in memory, and in wod_convert_dataset branch I convert the data in YOLOv8 directory format, because loading in memory works only with smaller subset.

Sriyab002 commented 12 months ago

hi @atanasko i see that you have successfully trained the yolov8 model with BEV images of Waymo dataset. i wanted to know how were the results. Did you need oriented bounding box information for the bev images. I am working on the KITTI dataset whose bev images have oriented objects with extra label information called yaw angle. but since yolo doesnt take yaw angle in the labels i am not obtaining the accurate results after training.

atanasko commented 12 months ago

hi @Sriyab002,

Yes, I do successfully train a model in two different ways. Initially I modify ultralytics YOLOv8 code to load labels and images in memory and this works, but later I realize that loading process does not implement any paging mechanism so I decide to convert the data from WOD into the YOLOv8 preferred directory structure and I train the model on full WOD dataset. I have good results where there is no object orientation. But where there is orientation results are not good enough (same as in your case). I also wait for the OBB implementation in YOLOv8 to try to train model again to detect objects with orientation.

Sriyab002 commented 11 months ago

hi @atanasko Thank you for sharing your insights. Like i said, i am working with KITTI dataset, i am trying to modify the model architecture using complex-yolov4 architecture as reference.

Petros626 commented 1 month ago

@atanasko great work! I try to achieve the same with LiDAR data from KITTI. @Sriyab002 I would use YOLOv8 OBB for the rotation, but the rotation is calculated by YOLO code itself. Maybe this is more precise than the 'rotation_y' parameter in the KITTI label format, maybe you have to pass these values better directly. I'm now at the beginning creating my augmentated data->BEV images generation->etc. But we can exchange, if you want.

atanasko commented 1 month ago

hi @atanasko Thank you for sharing your insights. Like i said, i am working with KITTI dataset, i am trying to modify the model architecture using complex-yolov4 architecture as reference.

hi @Sriyab002,

If I can help do not hesitate to contact me.

atanasko commented 1 month ago

@atanasko great work! I try to achieve the same with LiDAR data from KITTI. @Sriyab002 I would use YOLOv8 OBB for the rotation, but the rotation is calculated by YOLO code itself. Maybe this is more precise than the 'rotation_y' parameter in the KITTI label format, maybe you have to pass these values better directly. I'm now at the beginning creating my augmentated data->BEV images generation->etc. But we can exchange, if you want.

hi @Petros626,

If I can help do not hesitate to contact me.

Petros626 commented 1 month ago

@atanasko can you write me a mail (see profile)? I'd like to share later my insights and a repo about that work with the community here or in an new issue.

atanasko commented 1 month ago

@Petros626 you can send me mail on atanasko.mitrev@gmail.com

ultralytics / ultralytics