Problem in training……

qdtdl commented 2 years ago

running the command line python3 experiment.py train configs/magic-point_shapes.yaml magic-point_synth There is too much：Segmentation fault (core dumped)，Check failed: h != kInvalidChunkHandle Sometimes, when executed under the same conditions, the error reports are different. What's my problem?

rpautrat commented 2 years ago

Hi, This is probably an issue with one of your libraries. Can you post the full error message please?

qdtdl commented 2 years ago

tensorflow/core/common_runtime/bfc_allocator.cc:458] Check failed: c->in_use() && (c->bin_num == kInvalidBinNum) Aborted (core dumped)

tensorflow/core/common_runtime/bfc_allocator.cc:380] Check failed! h!=kInvalidChunkHandle Aborted(core dumped)

When there is Segmentation fault (core dumped), In fact, there is no other information shown

Although messages are different, I just found that all errors appear in validate. I use V100. When running the program, I find it in a very high memory usage. Maybe this is because 32G memory is not enough?

rpautrat commented 2 years ago

32G should be enough, at least if you keep a small batch size (e.g. 2).

What is your Tensorflow version?

qdtdl commented 2 years ago

As written in requirements.txt,1.12.0 This is my pip list Package Version

absl-py 0.13.0 argon2-cffi 21.1.0 astor 0.8.1 async-generator 1.10 attrs 21.2.0 backcall 0.2.0 bleach 4.1.0 certifi 2021.5.30 cffi 1.15.0 dataclasses 0.8 decorator 5.1.0 defusedxml 0.7.1 entrypoints 0.3 flake8 4.0.1 gast 0.5.2 grpcio 1.14.1 h5py 2.10.0 importlib-metadata 4.2.0 ipykernel 5.5.6 ipython 7.16.1 ipython-genutils 0.2.0 ipywidgets 7.6.5 jedi 0.18.0 Jinja2 3.0.2 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 7.0.6 jupyter-console 6.4.0 jupyter-core 4.9.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.2 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 Markdown 3.3.4 MarkupSafe 2.0.1 mccabe 0.6.1 mistune 0.8.4 mkl-fft 1.3.0 mkl-random 1.1.1 mkl-service 2.3.0 nbclient 0.5.4 nbconvert 6.0.7 nbformat 5.1.3 nest-asyncio 1.5.1 notebook 6.4.5 numpy 1.19.5 opencv-contrib-python 3.4.2.16 opencv-python 4.5.4.58 packaging 21.0 pandocfilters 1.5.0 parso 0.8.2 pexpect 4.8.0 pickleshare 0.7.5 pip 21.2.2 prometheus-client 0.11.0 prompt-toolkit 3.0.21 protobuf 3.17.2 ptyprocess 0.7.0 pycodestyle 2.8.0 pycparser 2.20 pyflakes 2.4.0 Pygments 2.10.0 pyparsing 3.0.1 pyrsistent 0.18.0 python-dateutil 2.8.2 PyYAML 6.0 pyzmq 22.3.0 qtconsole 5.1.1 QtPy 1.11.2 scipy 1.5.2 Send2Trash 1.8.0 setuptools 58.0.4 six 1.16.0 superpoint 0.0
tensorboard 1.12.2 tensorflow 1.12.0 tensorflow-gpu 1.12.0 termcolor 1.1.0 terminado 0.12.1 testpath 0.5.0 tornado 6.1 tqdm 4.62.3 traitlets 4.3.3 typing-extensions 3.10.0.2 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 2.0.1 wheel 0.37.0 widgetsnbextension 3.5.1 zipp 3.6.0

rpautrat commented 2 years ago

I think this is an issue with Tensorflow. You can try to reinstall it, or maybe follow the setup described in https://github.com/rpautrat/SuperPoint/issues/173#issue-730896838, which has been shown to work well for many people.

qdtdl commented 2 years ago

I finished the training according to your suggestion. Thank you very much.

qdtdl commented 2 years ago

one more question: I've trained superpoint and got my model in ckpt, but the code match_features_demo.py load sp_v6 in savedmodel I tried to convert my model to savedmodel but it doesn't work, what should I do?

rpautrat commented 2 years ago

Hi, what did you use to convert your model? You can use the script superpoint/export_model.py. It takes as input two parameters: the config file which allows you to load your trained model, and the name of the export that you want to create.

qdtdl commented 2 years ago

With your help, I finally got the model I need. Thank you very much for your reply

rpautrat / SuperPoint

Problem in training…… #238