voldemortX / pytorch-auto-drive

PytorchAutoDrive: Segmentation models (ERFNet, ENet, DeepLab, FCN...) and Lane detection models (SCNN, RESA, LSTR, LaneATT, BézierLaneNet...) based on PyTorch with fast training, visualization, benchmarking & deployment help
BSD 3-Clause "New" or "Revised" License
839 stars 137 forks source link

Running error #76

Closed liangjxiong closed 2 years ago

liangjxiong commented 2 years ago

Hello, I trained with my own dataset, and the following error occurred: Target size (torch.Size([20, 4])) must be the same as input size (torch.Size([20, 6])). How should I solve the above problems. thank you!

voldemortX commented 2 years ago

@liangjxiong Hi! Could you give some details, e.g., which model are you using and which layer/part of the network gives that error?

liangjxiong commented 2 years ago

Thank you! I have solved the above problems, but I have encountered new problems. The run command I use is python main_landet.py --train --config=configs/lane_detection/scnn/resnet18_tusimple.py --mixed-precision. The following error occurred: [1, 1] training loss: 1.9832 [1, 1] loss seg: 1.9147 [1, 1] loss exist: 0.6852 [1, 2] training loss: 1.9865 [1, 2] loss seg: 1.9177 [1, 2] loss exist: 0.6883 [1, 3] training loss: 1.3278 [1, 3] loss seg: 1.2590 [1, 3] loss exist: 0.6879 [1, 4] training loss: 0.6099 [1, 4] loss seg: 0.5421 [1, 4] loss exist: 0.6780 [1, 5] training loss: 0.9194 [1, 5] loss seg: 0.8511 [1, 5] loss exist: 0.6827 Traceback (most recent call last): File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/main_landet.py", line 65, in runner.run() File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 35, in run for i, data in enumerate(self.dataloader, 0): File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 5. Original Traceback (most recent call last): File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate return [default_collate(samples) for samples in transposed] File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 84, in return [default_collate(samples) for samples in transposed] File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: stack expects each tensor to be equal size, but got [6] at entry 0 and [7] at entry 7

My label is 5 to 715 lines, not 240 to 710 lines. Can the model accept this change?

voldemortX commented 2 years ago

@liangjxiong it seems your dataset does not load same dimension tensors for each image? As for 5-715, it should work correct if you provide the start, ppl, end info correctly in your Dataset class.

liangjxiong commented 2 years ago

The resolution of image and label is 1280 * 720. But some pictures contain four lane lines, and some pictures have three lane lines. Will these have an impact?

voldemortX commented 2 years ago

The resolution of image and label is 1280 * 720. But some pictures contain four lane lines, and some pictures have three lane lines. Will these have an impact?

FYI, of course these kind of tensors can't be simply batched. That is why all our non-segmentation methods use a dict to contain labels and apply dict_collate_fn and stack GT in the loss class. However, if you are using SCNN, the segmentation labels should be in the same format? Does your lane_existence GT have different shapes?

liangjxiong commented 2 years ago

Sorry, I didn't study deeply. What is the format of the segmentation labels? What does "lane_existence GT have different shapes" mean? We are all straight lines. What should I do now?

voldemortX commented 2 years ago

@liangjxiong If you refer to the TuSimple/CULane/LLAMAS labeling, you can find their labels have two parts for each image. 1. the segmentation mask, which is H x W x (C + 1), C (max possible number of lanes in pair, we'll get to that later) and extra 1 for background, actually it is often stored as H x W x 1. 2. the lane existence classification label, which determines the existence of each lane class, which is C. And C is the same for each image, if some image don't have C lanes, they simply have 0 on those lanes' labels.

Note that C is not the maximal possible number of lanes. Lanes are classified as ego-lanes (2 of them), left-right immediate lanes (another 2), and you can add more pairs (for instance TuSimple considers 6). For your case, probably it is 4.

liangjxiong commented 2 years ago

train.txt file format: train / image / 345 1 1 1 0 0 0 Is that right?

voldemortX commented 2 years ago

@liangjxiong it does seem quite correct, as long as all your images have the same 6 existence flags at the end.

voldemortX commented 2 years ago

You can add some print or debug around here to see if some line in train.txt has parsing issues or 7 flags.

liangjxiong commented 2 years ago

Thank you! I can use my dataset to train the model.

voldemortX commented 2 years ago

Thank you! I can use my dataset to train the model.

Sounds great! Since you resolved the problem, I'll close this issue. But do feel free to reopen.

liangjxiong commented 2 years ago

Hello! I have a new problem. Since the number of rows in my own dataset is not 56, the prediction output of the model is 56 rows. A format error occurred while running the tusimple test script. How can I modify it to make the predicted output of the model conform to the number of rows I set. thank you!

liangjxiong commented 2 years ago

Now, the output of the model has been changed to the number of lines I want, but the output still starts from 160 lines. I want to start from line 5 with an interval of 10. How should I modify it? thank you!

voldemortX commented 2 years ago

@liangjxiong do you mean 5, 160 and 10 by pixels?

voldemortX commented 2 years ago

You can search for things like 160, or if dataset == 'tusimple'. For instance, they are set in these places for tusimple testing:

https://github.com/voldemortX/pytorch-auto-drive/blob/4f6527660ef3e285e9bb92f374f495f33e32216a/utils/runners/lane_det_tester.py#L73

https://github.com/voldemortX/pytorch-auto-drive/blob/4f6527660ef3e285e9bb92f374f495f33e32216a/utils/lane_det_utils.py#L31

voldemortX commented 2 years ago

I'd suggest adding your own customized codes (elif) for your customized dataset, although modifying the tusimple code is also fine,

liangjxiong commented 2 years ago

Thank you! The above problems have been solved. Can I test the reasoning speed of the model?

voldemortX commented 2 years ago

@liangjxiong You can refer to BENCHMARK.md for speed testing.

liangjxiong commented 2 years ago

run:python tools/profiling.py --mode=simple --config=configs/lane_detection/scnn/resnet18_tusimple.py --times=3 --height=720 --width=1280 The following error occurred: Traceback (most recent call last): File "tools/profiling.py", line 39, in cfg = read_config(args.config) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/args.py", line 57, in read_config module = SourceFileLoader(module_name, config_path).load_module() File "", line 399, in _check_name_wrapper File "", line 823, in load_module File "", line 682, in load_module File "", line 265, in _load_module_shim File "", line 684, in _load File "", line 665, in _load_unlocked File "", line 678, in exec_module File "", line 219, in _call_with_frames_removed File "configs/lane_detection/scnn/resnet18_tusimple.py", line 2, in from configs.lane_detection.common.datasets.tusimple_seg import dataset ModuleNotFoundError: No module named 'configs.lane_detection'

voldemortX commented 2 years ago

run:python tools/profiling.py --mode=simple --config=configs/lane_detection/scnn/resnet18_tusimple.py --times=3 --height=720 --width=1280 The following error occurred: Traceback (most recent call last): File "tools/profiling.py", line 39, in cfg = read_config(args.config) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/args.py", line 57, in read_config module = SourceFileLoader(module_name, config_path).load_module() File "", line 399, in _check_name_wrapper File "", line 823, in load_module File "", line 682, in load_module File "", line 265, in _load_module_shim File "", line 684, in _load File "", line 665, in _load_unlocked File "", line 678, in exec_module File "", line 219, in _call_with_frames_removed File "configs/lane_detection/scnn/resnet18_tusimple.py", line 2, in from configs.lane_detection.common.datasets.tusimple_seg import dataset ModuleNotFoundError: No module named 'configs.lane_detection'

It seems you are the second one to have this issue, I will look into it again later on. In the meantime, try move tools/profiling.py to profiling.py.

liangjxiong commented 2 years ago

Thank you! I'm looking forward to it.

voldemortX commented 2 years ago

@liangjxiong I still can't figure out the reason why loading python files fail on certain environments, in fact I build new envs and they all work fine. However I do have a recommended solution (other than copying files out of ./tools):

export PYTHONPATH=$PWD:$PYTHONPATH

Execute that when you are in the pytorch-auto-drive folder and everything should be fine.

voldemortX commented 2 years ago

If anyone figures this out, please post here and let everyone know.

voldemortX commented 2 years ago

Thank you! I'm looking forward to it.

This should be fixed by #86 . Feel free to test and reopen if the problem persists.

liangjxiong commented 2 years ago

你好!我执行这命令之后export PYTHONPATH=$PWD:$PYTHONPATH。出现了新的问题! (pad) xianjin@xianjin-W580-G20:/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive$ python tools/profiling.py --mode=simple --config=configs/lane_detection/scnn/resnet18_tusimple.py --times=3 --height=720 --width=1280 Loaded torchvision ImageNet pre-trained weights V1. cuda:0 torch.float32 Traceback (most recent call last): File "tools/profiling.py", line 55, in fps.append(speed_evaluate_simple(net=net, device=device, dummy=dummy, num=300)) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/profiling_utils.py", line 95, in speed_evaluatesimple = net(dummy) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/models/segmentation/_utils.py", line 61, in forward result['lane'] = self.lane_classifier(x.softmax(dim=1)) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/models/common_models/heads/simple_lane_exist.py", line 22, in forward output = self.linear1(output) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 91, in forward return F.linear(input, self.weight, self.bias) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/functional.py", line 1674, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: mat1 dim 1 must match mat2 dim 0

voldemortX commented 2 years ago

@liangjxiong your profiling height and width may be not aligned with the config model. I think it should be h=360, w=640

liangjxiong commented 2 years ago

现在可以运行了。Tusimple数据集h=720 w=1280, 是不是先压缩了一半才输入模型的?

voldemortX commented 2 years ago

现在可以运行了。Tusimple数据集h=720 w=1280, 是不是先压缩了一半才输入模型的?

yes. 这个是大家的默认输入大小。

liangjxiong commented 2 years ago

我懂了,谢谢您!

liangjxiong commented 2 years ago

您好!我刚刚更新的您的代码,出现了下面错误。 pad) xianjin@xianjin-W580-G20:/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive$ CUDA_VISIBLE_DEVICES=1 python main_landet.py --train --config=configs/lane_detection/scnn/resnet18_tusimple.py --mixed-precision Loaded torchvision ImageNet pre-trained weights V1. Not using distributed mode cuda Traceback (most recent call last): File "main_landet.py", line 64, in runner = Runner(cfg=cfg) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 17, in init super().init(cfg) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/base.py", line 120, in init self.init_exp_dir(cfg, 'train') File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/base.py", line 93, in init_exp_dir f.write(json.dumps(cfg, indent=4)) TypeError: <function import_from at 0x7fc79eb06a60> is not JSON serializable 是什么原因呢?

voldemortX commented 2 years ago

@liangjxiong 我好像少commit了一次。试试现在的master呢

liangjxiong commented 2 years ago

太感谢了!正常运行!

liangjxiong commented 2 years ago

你好!还是之前的数据集,运行SCNN是正常的,但我执行python main_landet.py --train --config=configs/lane_detection/lstr/resnet18s_tusimple.py。就出现线面错误: Not using distributed mode cuda Loading targets into memory... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:00<00:00, 643.67it/s] Traceback (most recent call last): File "main_landet.py", line 65, in runner.run() File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 52, in run self.model) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 124, in forward loss, log_dict = self.calc_full_loss(outputs=outputs, targets=targets) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 136, in calc_full_loss indices = self.matcher(outputs=outputs, targets=targets) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(args, kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 71, in forward norm_weights, valid_points = lane_normalize_in_batch(target_keypoints) # G, G x N File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 24, in lane_normalize_in_batch norm_weights /= norm_weights.max() RuntimeError: operation does not have an identity.

voldemortX commented 2 years ago

你好!还是之前的数据集,运行SCNN是正常的,但我执行python main_landet.py --train --config=configs/lane_detection/lstr/resnet18s_tusimple.py。就出现线面错误: Not using distributed mode cuda Loading targets into memory... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 227/227 [00:00<00:00, 643.67it/s] Traceback (most recent call last): File "main_landet.py", line 65, in runner.run() File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/runners/lane_det_trainer.py", line 52, in run self.model) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 124, in forward loss, log_dict = self.calc_full_loss(outputs=outputs, targets=targets) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 136, in calc_full_loss indices = self.matcher(outputs=outputs, targets=targets) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(args, kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 71, in forward norm_weights, valid_points = lane_normalize_in_batch(target_keypoints) # G, G x N File "/home/xianjin/anaconda3/envs/pad/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context return func(*args, **kwargs) File "/data/disk1/liangjxiong/pycharm_project/pytorch-auto-drive/pytorch-auto-drive/utils/losses/hungarian_loss.py", line 24, in lane_normalize_in_batch norm_weights /= norm_weights.max() RuntimeError: operation does not have an ident

let me verify if that is a bug tomorrow.

liangjxiong commented 2 years ago

谢谢您!我已经解决了!是我数据集设置的问题。您的代码没有问题。

voldemortX commented 2 years ago

-_- ok那我close了

mengxia1994 commented 1 year ago

谢谢您!我已经解决了!是我数据集设置的问题。您的代码没有问题。

请问下是怎么解决的,我也遇到同样问题了,跑lstr的时候这里报错