Closed thiagoribeirodamotta closed 1 year ago
After reading the links below, it seems my errors were because the dataloaders at my config file were lacking a Sampler. Adding sampler=dict(_delete_=True, type='DefaultSampler', shuffle=True)
to both train_dataloader, val_dataloader and test_dataloader and collate_fn=dict(_delete_=True, type='yolov5_collate')
to train_dataloader, the aforementioned error disappeared.
Link 1: https://github.com/FishAndWasabi/YOLO-MS/issues/4 Link 2: https://github.com/FishAndWasabi/YOLO-MS/issues/8
Prerequisite
š Describe the bug
When running the script mmyolo.tools.train.py on a custom config file, the following error pops up:
Traceback (most recent call last): File "/project/src/mmyolo/tools/train.py", line 123, in <module> main() File "/project/src/mmyolo/tools/train.py", line 119, in main runner.train() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 1745, in train model = self.train_loop.run() # type: ignore File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 96, in run self.run_epoch() File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py", line 111, in run_epoch for idx, data_batch in enumerate(self.dataloader): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1348, in _next_data return self._process_data(data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1374, in _process_data data.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 697, in reraise raise exception TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) TypeError: yolov5_collate() got an unexpected keyword argument '_scope_'
Environment
Made a custom installation of mmyolo using a docker file with an image from NVIDIA (23.08-py3), which uses pytorch version=2.1.0a0+29c30b1.
Since the Pytorch version used on the nvidia container is greater than 2.0.0, there is a bug with the latest version of opencv-python, which had to be down versioned to 4.8.0.74.
Also because of the Pytorch version, I had to git clone MMCV instead of using mim to install it, since the fix to c++17 compiler only happened about 2 weeks ago, where the latest mim version of MMCV was released a few months ago.
The directory structure is currently as follows:
Installation was done with the following Dockerfile:
Additional information
Since MMYolo was installed as a 3rd party tool with the command
mim install "mmyolo"
, I manually copied the tools/train.py script to a custom folder (no modifications made here).Besides that, the following config file is being used (mainly changed the paths and dataset configs; dataset is constituted of a single class, but there are different files for train and validation; also changed base to use yolov7 image as follows base = 'mmyolo::yolov7/yolov7_l_syncbn_fast_8x16b-300e_coco.py' ):