Closed alex-costanzino closed 1 year ago
If you are trying to use UFF module, use --use_uff
in your python command.
Besides, I noticed there is a bug in the previous version, using new feature_extractors/features.py
may help you.
Many thanks for the quick reply.
Actually I wasn't trying to use the UFF module, I was trying to follow the steps that you listed on the readme. As far as I understood the training of the UFF it's in the second step (Train the UFF).
By the way, with the new feature.py
I managed to execute the first step, but now I have a problem with the second one (Train the UFF). The stack error is the following:
/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
Traceback (most recent call last):
File "fusion_pretrain.py", line 24, in <module>
import data.dataset
ModuleNotFoundError: No module named 'data.dataset'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 100591) of binary: /home/alex/torch/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
fusion_pretrain.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-27_10:21:18
host : hades
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 100591)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I think it could be a version conflict, since the module it's actually installed.
This is caused by our code problem, try to change data.dataset
to dataset
, or use the new fusion_pretrain.py
.
Unfortunately, even by changing the fusion_pretrain.py
I obtain a similar error:
/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
| distributed init (rank 0): env://, gpu 0
[11:08:58.402076] job dir: /media/data/alex/3d_anomaly_detection/M3DM
[11:08:58.402148] Namespace(accum_iter=16,
batch_size=16,
blr=0.002,
data_path='datasets/patch_lib',
device='cuda',
dist_backend='nccl',
dist_on_itp=False,
dist_url='env://',
distributed=True,
epochs=10,
gpu=0,
input_size=224,
local_rank=0,
log_dir='./output_dir',
lr=0.003,
min_lr=0.0,
num_workers=10,
output_dir='checkpoints',
pin_mem=True,
rank=0,
resume='',
seed=0,
start_epoch=0,
warmup_epochs=1,
weight_decay=1.5e-06,
world_size=1)
Traceback (most recent call last):
File "fusion_pretrain.py", line 201, in <module>
main(args)
File "fusion_pretrain.py", line 110, in main
dataset_train = data.dataset.PreTrainTensorDataset(args.data_path)
NameError: name 'data' is not defined
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 106060) of binary: /home/alex/torch/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/alex/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
fusion_pretrain.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-27_11:09:01
host : hades
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 106060)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Maybe change data.dataset.PreTrainTensorDataset(args.data_path)
to dataset.PreTrainTensorDataset(args.data_path)
will be work.
PS: This git is still in the test phase, and maybe there are some unexpected bugs. We will check the full code as soon as we can.
Now it seems to be working, many thanks again, I'll let you know if I find other bugs if this may help the test phase.
I have another question: to use the Eyecandies dataset it's sufficient to pass a different parameter or another dataloader is needed? I see that the structure of the folders it's really different with respect to the ones of MVTec 3D-AD.
Thanks for trying our code, we are happy to receive your feedback!
The code of Eyecandies
is already released in this git, we will give the instruction later. Or you can refer to utils/preprocess_eyecandies.py
to pre-process the data and use --dataset_type eyecandies
command during training with your own hyper-parameters.
I think there's a problem with multiple_features.py
with EyeCandies because it seems that the patches that are passed to the coreset are always empty. Even by preprocessing the dataset I get the following error:
Running coreset for DINO+Point_MAE on class CandyCane...
Traceback (most recent call last):
File "main.py", line 124, in <module>
run_3d_ads(args)
File "main.py", line 20, in run_3d_ads
model.fit(cls)
File "/media/data/alex/3d_anomaly_detection/M3DM/m3dm_runner.py", line 57, in fit
method.run_coreset()
File "/media/data/alex/3d_anomaly_detection/M3DM/feature_extractors/multiple_features.py", line 485, in run_coreset
self.patch_xyz_lib = torch.cat(self.patch_xyz_lib, 0)
This error is usually caused by the dataset problem. Maybe modifying the utils/preprocess_eyecandies.py
file can help you, the path used in this file may not match your dataset path. The processed Eyecandies dataset has a similar structure as the MVtec3d-AD dataset.
No, the dataset path in the *.py
was fine. The problem was that I also needed to specify via arguement --dataset_path
where eyecandies is located (it seems that there is no default path for this case).
Hi, first of all many thanks for publishing your very interesting work.
I have a problem with the first phase of train and testing (Train and test the double lib version and save the feature for UFF training), in particular with the training of Decision Layers Fusion with DINO+Point_MAE.
The stack error I receive is the following:
Do you have any idea on what could be the problem?