无法复现结果 - Githubissues

Isaac-xie commented 1 year ago

当使用脚本训练时 python scripts/train.py +experiment=gkt_nuscenes_vehicle_kernel_7x1.yaml data.dataset_dir=<path/to/nuScenes> data.labels_dir=<path/to/labels> 训练完毕后，指标较低 {0d6e6989-cc7a-44e5-8ac2-8915041c8fe0}

使用github上模型进行验证时 python scripts/eval.py +experiment=gkt_nuscenes_vehicle_kernel_7x1.yaml data.dataset_dir=<path/to/nuScenes> data.labels_dir=<path/to/labels> experiment.ckptt <path/to/checkpoint> 报错 optimizer_loop.optim_progress.optimizer.step.total.completed = self._loaded_checkpoint["global_step"] KeyError: 'global_step'

BigQ0710 commented 1 year ago

您好，我想问一下您是用所有数据做的，还是仅仅用mini数据

BigQ0710 commented 1 year ago

我仅用mini数据，一直报错

BigQ0710 commented 1 year ago

![Uploading image.jpg…]()

Isaac-xie commented 1 year ago

@BigQ0710 所有数据

XiaoqiangWu12138 commented 1 year ago

首先感谢您的回复，其次我报了这样的错误，我用的就是keyframe数据啊，可是还是报错了，想问一下怎么处理，谢谢。 (GKT) wxq@wxq:~/GKT/segmentation$ python scripts/train.py +experiment=gkt_nuscenes_vehicle_kernel_7x1.yaml data.dataset_dir=/home/wxq/GKT/media/datasets/nuscenes data.labels_dir=/home/wxq/GKT/media/datasets/cvt_labels_nuscenes Global seed set to 2022 Loaded pretrained weights for efficientnet-b4 [2022-12-15 09:28:58,700][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpvgtnko5d [2022-12-15 09:28:58,701][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpvgtnko5d/_remote_module_non_sriptable.py [2022-12-15 09:28:59,270][main][INFO] - Searching /home/wxq/GKT/segmentation/logs. GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used.. `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used.. `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch.. Global seed set to 2022 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 [2022-12-15 09:28:59,542][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0 [2022-12-15 09:28:59,543][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

| Name | Type | Params

0 | backbone | CrossViewTransformer | 1.2 M 1 | loss_func | MultipleLoss | 0
2 | metrics | MetricCollection | 0

1.2 M Trainable params 0 Non-trainable params 1.2 M Total params 4.701 Total estimated model params size (MB) /home/wxq/GKT/segmentation/cross_view_transformer/tabular_logger.py:36: UserWarning: Experiment logs directory /home/wxq/GKT/segmentation/logs/lightning_logs/version_7 exists and is not empty. Previous log files in this directory will be deleted when the new ones are saved! rank_zero_warn( /home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:486: PossibleUserWarning: Your val_dataloader's sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test/predict dataloaders. rank_zero_warn( [2022-12-15 09:29:34,821][cross_view_transformer.tabular_logger][INFO] - lr-AdamW:0.000400, step:0 /home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:2719: UserWarning: Using trainer.logger when Trainer is configured to use multiple loggers. This behavior will change in v1.8 when LoggerCollection is removed, and trainer.logger will return the first logger in trainer.loggers rank_zero_warn( /home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:44: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_warn has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead. new_rank_zero_deprecation( /home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:49: UserWarning: Invalid logger <pytorch_lightning.loggers.base.LoggerCollection object at 0x7fa5a29786a0> return new_rank_zero_warn(*args, kwargs) [2022-12-15 09:29:53,785][root][INFO] - Reducer buckets have been rebuilt in this iteration. Error executing job with overrides: ['+experiment=gkt_nuscenes_vehicle_kernel_7x1.yaml', 'data.dataset_dir=/home/wxq/GKT/media/datasets/nuscenes', 'data.labels_dir=/home/wxq/GKT/media/datasets/cvt_labels_nuscenes'] Traceback (most recent call last): File "scripts/train.py", line 70, in main trainer.fit(model_module, datamodule=data_module, ckpt_path=ckpt_path) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._call_and_handle_interrupt( File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, *kwargs) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(args, kwargs) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run results = self._run_stage() File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage return self._run_train() File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train self.fit_loop.run() File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, kwargs) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, *kwargs) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 171, in advance batch = next(data_fetcher) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next return self.fetching_function() File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function self._fetch_next_batch(self.dataloader_iter) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch batch = next(iterator) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 553, in next return self.request_next_batch(self.loader_iters) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 565, in request_next_batch return apply_to_collection(loader_iters, Iterator, next) File "/home/wxq/.local/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection return function(data, args, kwargs) File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data return self._process_data(data) File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/wxq/.local/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise raise exception FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 2. Original Traceback (most recent call last): File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/wxq/.local/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 416, in getitem return self.datasets[dataset_idx][sample_idx] File "/home/wxq/GKT/segmentation/cross_view_transformer/data/nuscenes_dataset_generated.py", line 52, in getitem data = self.transform(data) File "/home/wxq/GKT/segmentation/cross_view_transformer/data/transforms.py", line 192, in call result.update(self.get_cameras(batch, **self.image_config)) File "/home/wxq/GKT/segmentation/cross_view_transformer/data/transforms.py", line 130, in get_cameras image = Image.open(self.dataset_dir / image_path) File "/home/wxq/.local/lib/python3.8/site-packages/PIL/Image.py", line 3131, in open fp = builtins.open(filename, "rb") FileNotFoundError: [Errno 2] No such file or directory: '/home/wxq/GKT/media/datasets/nuscenes/samples/CAM_FRONT_LEFT/n008-2018-05-21-11-06-59-0400CAM_FRONT_LEFT1526915779654917.jpg'

GoroYeh56 commented 1 year ago

@Isaac-xie Did you use the pretrained model provided by the author? This is my training result:

JacksonVation commented 1 year ago

@GoroYeh56 Thank you for your kind sharing. How did you use the pretrained model? I thought it would load it automatically if I just downloaded it. However, my training results are very low. So I think the pretrained model may not be loaded during my training.

linlion0311 commented 1 year ago

@Isaac-xie @GoroYeh56 @JacksonVation Hello,I want to know why my training metrics are different from yours? Some parts are missing, e.g., "iou_with_occlisions." I haven't modified the validation code.

365636603_620654196579990_3766710695413718176_n

hustvl / GKT

无法复现结果 #7

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

0 | backbone | CrossViewTransformer | 1.2 M 1 | loss_func | MultipleLoss | 0
2 | metrics | MetricCollection | 0

hustvl / GKT

无法复现结果 #7

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

0 | backbone | CrossViewTransformer | 1.2 M 1 | loss_func | MultipleLoss | 0 2 | metrics | MetricCollection | 0

0 | backbone | CrossViewTransformer | 1.2 M 1 | loss_func | MultipleLoss | 0
2 | metrics | MetricCollection | 0