Error during training and prediction

wudiliyao commented 6 months ago

I downloaded the pretrained model and tried to run the test, but it prompted me that the torch size did not match. I read the paper and found that the ecc size is 64. Why is the loaded pre-trained model 704? Traceback (most recent call last): File "learning/pssnet_main.py", line 518, in <module> main() File "learning/pssnet_main.py", line 182, in main model, optimizer, stats = resume(args, dbinfo) File "learning/pssnet_main.py", line 451, in resume model.load_state_dict({k: checkpoint['state_dict'][k] for k in checkpoint['state_dict'] if File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1672, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Module: size mismatch for ecc.1.weight: copying a param with shape torch.Size([6, 704]) from checkpoint, the shape in current model is torch.Size([6, 64]).

I tried to start training again, using all the data in the SUM dataset, but an error occurred when loading the h5 file of the graph. Traceback (most recent call last): File "learning/pssnet_main.py", line 518, in <module> main() File "learning/pssnet_main.py", line 368, in main acc, loss, oacc, avg_iou = train() File "learning/pssnet_main.py", line 210, in train for bidx, (targets, GIs, clouds_data) in enumerate(loader): File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tqdm/std.py", line 1181, in __iter__ for obj in iterable: File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torchnet/dataset/listdataset.py", line 54, in __getitem__ return self.load(self.list[idx]) File "/home/ricky/PSSNet/step-2/learning/pssnet_spg.py", line 218, in loader cloud, add_feas = load_superpoint(args, db_path + '/parsed/' + fname + '.h5', G.vs[s]['v'], train, test_seed_offset, s_label) File "/home/ricky/PSSNet/step-2/learning/pssnet_spg.py", line 270, in load_superpoint P = hf['{:d}'.format(id)] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/h5py/_hl/group.py", line 357, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object '1580' doesn't exist)" I haven't changed the running command and data, which is really frustrating. I look forward to help, thank you very much

WeixiaoGao commented 6 months ago

Hi, Thank you for your interest in our project. Regarding your question, I am currently unable to determine the specific cause from the error message you provided. As I cannot access the server where I tested Pssnet at the moment, I'm unable to replicate the error you encountered. However, I will soon find another computer to reconfigure the necessary environment and retest the code and pre-trained model. In the meantime, I suggest you check and ensure that all output results from step1 are correct, and place the output data into the corresponding folder for step2. Theoretically, if the input data for step2 is correct, you should get the correct prediction results.

wudiliyao commented 6 months ago

Thank you very much for your kind reply. I did some checks on the code to make sure the file was being read, but I couldn't make sure the file was being generated correctly. My file structure is as follows:datasetd/custom_set/data/train/_pcl_gcn.ply; datasetd/custom_set/pssnet_graphs/train/_graph.ply; datasetd/custom_set/features/train/_pcl_gcn.h5; datasetd/custom_set/superpoints_graphs/train/_graph.h5; datasetd/custom_set/parsed/train/_graph.ply;_pcl_gcn.ply; datasetd/custom_set/resulted/model.pth.tar;

WeixiaoGao commented 6 months ago

Hi,

I've conducted small tests on a new machine (though not yet with the full dataset) and found no significant errors in the code of setp-2. It seems likely that the issues stem from configuration errors or incorrect outputs from step 1.

For the pretrained model, please update your --model_config in pssnet_main.py to 'gru_10_1_1_1_1,f_6'. Additionally, setting --nworkers to zero could help avoid certain potential errors.

To inspect the output from step 1: For pssnet_graphs, Meshlab can be used for visual inspection. For input point clouds and features in data, consider using Mapple for visualization. For examining 'h5' files, HDFView is a suitable tool to check the values.

If issues persist, please let me know and then I will try to upload all intermediate data, although it's quite large, possibly several gigabytes.

wudiliyao commented 6 months ago

According to your suggestion, I updated the test configuration (--model_config), but the issue still exists `0%| | 0/12 [00:00<?, ?it/s]loadsuperpoint test/Tile+1984_+2693_graph512 load_superpoint /home/ricky/PSSNet/step-2/partition/datasets/customset/parsed/test/Tile+1984_+2693_graph.h50 0%| | 0/12 [00:00<?, ?it/s] Traceback (most recent call last): File "learning/pssnet_main.py", line 518, in main()

File "learning/pssnet_main.py", line 425, in main acc_test, oacc_test, avg_iou_test, per_class_iou_test, predictions_test, avg_acc_test, confusion_matrix = eval_final() File "learning/pssnet_main.py", line 315, in eval_final for bidx, (targets, GIs, clouds_data) in enumerate(loader): File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torchnet/dataset/listdataset.py", line 54, in getitem return self.load(self.list[idx]) File "/home/ricky/PSSNet/step-2/learning/pssnet_spg.py", line 218, in loader cloud, add_feas = load_superpoint(args, db_path + '/parsed/' + fname + '.h5', G.vs[s]['v'], train, test_seed_offset, s_label) File "/home/ricky/PSSNet/step-2/learning/pssnet_spg.py", line 270, in load_superpoint P = hf['{:d}'.format(id)] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/h5py/_hl/group.py", line 357, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object '0' doesn't exist)" ` I also tried to use visualization tools to view the input data, but I don’t know if there is some problem with the data. Screenshot from 2024-03-26 21-17-50 the graph h5 file snapshot01 snapshot02 the graph_edge ply file

wudiliyao commented 6 months ago

And here is the pcl file. I would like to know how to fix the problem. I am deeply troubled by this issue. Thank you very much.

WeixiaoGao commented 6 months ago

Hi, I noticed that the data loader is having issues with loading the .h5 files. This could be due to a problem with the .h5 files themselves or the path to your data. Typically, we place the datasets folder within the step-2 folder. If you've moved this folder, ensure you've also updated the relative path in the configuration file to reflect this change. Could you please verify these details? By the way, how did you generate the data in the parsed folder?

wudiliyao commented 6 months ago

Hi, I noticed that the data loader is having issues with loading the .h5 files. This could be due to a problem with the .h5 files themselves or the path to your data. Typically, we place the datasets folder within the step-2 folder. If you've moved this folder, ensure you've also updated the relative path in the configuration file to reflect this change. Could you please verify these details? By the way, how did you generate the data in the parsed folder?

I have copied all the h5 files to the "parsed" folder, as the "pssnet_spg.py" file needs to load them. Could you please tell me more clearly about the contents of the "parsed" folder and how it should be generated?

WeixiaoGao commented 6 months ago

I believe that could be the underlying issue. To properly organize the point clouds into superpoints, you'll need to execute the script by running python learning/pssnet_custom_dataset.py. This process generates 'parsed' data, reorganizing point clouds into superpoints. Simply copying them from other folders will not suffice, as this specific organization is required.

wudiliyao commented 6 months ago

With your help, I believe I have generated the h5 file correctly，and superpoint has also been successfully loaded. However,new error occurred when I performed the test. Traceback (most recent call last): File "learning/pssnet_main.py", line 518, in <module> main() File "learning/pssnet_main.py", line 425, in main acc_test, oacc_test, avg_iou_test, per_class_iou_test, predictions_test, avg_acc_test, confusion_matrix = eval_final() File "learning/pssnet_main.py", line 325, in eval_final outputs = model.ecc(embeddings) File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/ricky/PSSNet/step-2/learning/../learning/graphnet.py", line 100, in forward input = module(input) File "/home/ricky/.pyenv/versions/3.7.8/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/ricky/PSSNet/step-2/learning/../learning/modules.py", line 174, in forward self._edge_mem_limit) File "/home/ricky/PSSNet/step-2/learning/../learning/ecc/GraphConvModule.py", line 78, in forward ctx._degs_gpu.narrow(0, startd, numd)) File "/home/ricky/PSSNet/step-2/learning/../learning/ecc/cuda_kernels.py", line 125, in conv_aggregate_fw function, stream = get_kernel_func('conv_aggregate_fw_kernel_v2', conv_aggregate_fw_kernel_v2(), get_dtype(src)) File "/home/ricky/PSSNet/step-2/learning/../learning/ecc/cuda_kernels.py", line 38, in get_kernel_func prog = Program(ksrc, kname+dtype+'.cu') NameError: name 'Program' is not defined Could you help me solve this issue,thank you very much.

WeixiaoGao commented 6 months ago

Please check line 38 in /learning/ecc/cuda_kernels.py, there are instructions on how to fix this issue.

wudiliyao commented 6 months ago

Please check line 38 in /learning/ecc/cuda_kernels.py, there are instructions on how to fix this issue.

I have already tried it, but the problem still exists.

WeixiaoGao commented 6 months ago

The issue may be with 'cupy'. Please try to reinstall it for your CUDA version with pip install cupy-cudaXXX, replacing 'XXX' with your CUDA version. If problems continue, add the code below at the function's start to output diagnostic info:

    import cupy.cuda
    print('"cupy loaded")
    from pynvrtc.compiler import Program
    print('"pynvrtc loaded")
except:
    print('"loading failed")
    pass

wudiliyao commented 6 months ago

The issue may be with 'cupy'. Please try to reinstall it for your CUDA version with pip install cupy-cudaXXX, replacing 'XXX' with your CUDA version. If problems continue, add the code below at the function's start to output diagnostic info:
    import cupy.cuda
    print('"cupy loaded")
    from pynvrtc.compiler import Program
    print('"pynvrtc loaded")
except:
    print('"loading failed")
    pass
After reinstalling “cupy”, it worked! Thanks for your instruction step by step

tudelft3d / PSSNet

Error during training and prediction #8