nomewang / M3DM

MIT License
146 stars 19 forks source link

No 'detect_fuser' while training the UFF model #1

Closed alex-costanzino closed 1 year ago

alex-costanzino commented 1 year ago

Hi, first of all many thanks for publishing your very interesting work.

I have a problem with the first phase of train and testing (Train and test the double lib version and save the feature for UFF training), in particular with the training of Decision Layers Fusion with DINO+Point_MAE.

The stack error I receive is the following:

Training Dicision Layer Fusion for DINO+Point_MAE on class bagel...
Traceback (most recent call last):
  File "main.py", line 124, in <module>
    run_3d_ads(args)
  File "main.py", line 20, in run_3d_ads
    model.fit(cls)
  File "/media/data/alex/3d_anomaly_detection/M3DM/m3dm_runner.py", line 72, in fit
    method.run_late_fusion()
  File "/media/data/alex/3d_anomaly_detection/M3DM/feature_extractors/features.py", line 176, in run_late_fusion
    self.detect_fuser.fit(self.s_lib)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DoubleRGBPointFeatures' object has no attribute 'detect_fuser'

Do you have any idea on what could be the problem?

nomewang commented 1 year ago

If you are trying to use UFF module, use --use_uff in your python command. Besides, I noticed there is a bug in the previous version, using new feature_extractors/features.py may help you.

alex-costanzino commented 1 year ago

Many thanks for the quick reply.

Actually I wasn't trying to use the UFF module, I was trying to follow the steps that you listed on the readme. As far as I understood the training of the UFF it's in the second step (Train the UFF).

By the way, with the new feature.py I managed to execute the first step, but now I have a problem with the second one (Train the UFF). The stack error is the following:

/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
Traceback (most recent call last):
  File "fusion_pretrain.py", line 24, in <module>
    import data.dataset
ModuleNotFoundError: No module named 'data.dataset'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 100591) of binary: /home/alex/torch/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
fusion_pretrain.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-27_10:21:18
  host      : hades
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 100591)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I think it could be a version conflict, since the module it's actually installed.

nomewang commented 1 year ago

This is caused by our code problem, try to change data.dataset to dataset, or use the new fusion_pretrain.py.

alex-costanzino commented 1 year ago

Unfortunately, even by changing the fusion_pretrain.py I obtain a similar error:

/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
| distributed init (rank 0): env://, gpu 0
[11:08:58.402076] job dir: /media/data/alex/3d_anomaly_detection/M3DM
[11:08:58.402148] Namespace(accum_iter=16,
batch_size=16,
blr=0.002,
data_path='datasets/patch_lib',
device='cuda',
dist_backend='nccl',
dist_on_itp=False,
dist_url='env://',
distributed=True,
epochs=10,
gpu=0,
input_size=224,
local_rank=0,
log_dir='./output_dir',
lr=0.003,
min_lr=0.0,
num_workers=10,
output_dir='checkpoints',
pin_mem=True,
rank=0,
resume='',
seed=0,
start_epoch=0,
warmup_epochs=1,
weight_decay=1.5e-06,
world_size=1)
Traceback (most recent call last):
  File "fusion_pretrain.py", line 201, in <module>
    main(args)
  File "fusion_pretrain.py", line 110, in main
    dataset_train = data.dataset.PreTrainTensorDataset(args.data_path)
NameError: name 'data' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 106060) of binary: /home/alex/torch/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
fusion_pretrain.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-27_11:09:01
  host      : hades
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 106060)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
nomewang commented 1 year ago

Maybe change data.dataset.PreTrainTensorDataset(args.data_path) to dataset.PreTrainTensorDataset(args.data_path) will be work.

PS: This git is still in the test phase, and maybe there are some unexpected bugs. We will check the full code as soon as we can.

alex-costanzino commented 1 year ago

Now it seems to be working, many thanks again, I'll let you know if I find other bugs if this may help the test phase.

I have another question: to use the Eyecandies dataset it's sufficient to pass a different parameter or another dataloader is needed? I see that the structure of the folders it's really different with respect to the ones of MVTec 3D-AD.

nomewang commented 1 year ago

Thanks for trying our code, we are happy to receive your feedback!

The code of Eyecandies is already released in this git, we will give the instruction later. Or you can refer to utils/preprocess_eyecandies.py to pre-process the data and use --dataset_type eyecandies command during training with your own hyper-parameters.

alex-costanzino commented 1 year ago

I think there's a problem with multiple_features.py with EyeCandies because it seems that the patches that are passed to the coreset are always empty. Even by preprocessing the dataset I get the following error:

Running coreset for DINO+Point_MAE on class CandyCane...
Traceback (most recent call last):
  File "main.py", line 124, in <module>
    run_3d_ads(args)
  File "main.py", line 20, in run_3d_ads
    model.fit(cls)
  File "/media/data/alex/3d_anomaly_detection/M3DM/m3dm_runner.py", line 57, in fit
    method.run_coreset()
  File "/media/data/alex/3d_anomaly_detection/M3DM/feature_extractors/multiple_features.py", line 485, in run_coreset
    self.patch_xyz_lib = torch.cat(self.patch_xyz_lib, 0)
nomewang commented 1 year ago

This error is usually caused by the dataset problem. Maybe modifying the utils/preprocess_eyecandies.py file can help you, the path used in this file may not match your dataset path. The processed Eyecandies dataset has a similar structure as the MVtec3d-AD dataset.

alex-costanzino commented 1 year ago

No, the dataset path in the *.py was fine. The problem was that I also needed to specify via arguement --dataset_path where eyecandies is located (it seems that there is no default path for this case).