Steps to Reproduce Results from Paper

nchodosh commented 2 years ago

Hi, thanks for releasing this code! Very cool project. I'm trying to reproduce your results from the upcoming CVPR paper but the experiments as given in the repo don't seem to work. I successfully installed all the requirements (took some modification as the specified tf version is now very old), built the user ops, ran the create.py scripts to generate the datasets and tried to run the main training but then had some issues.

The default experiment cli.py --prod errors out immediately when the assert 'dynamic in eval flow types fails.

I then tried adding the sota_us and sota_net arguments which lets it actually train but the loss is extremely unstable and errors out afte 1-6k iterations with this:

Invalid argument:  assertion failed: [5] [0.0162323527 0.0989187136 0.164023072...] [inf inf inf...]
     [[{{node our_pillar_model/unsupervised_loss/knn_loss_components_3/get_flow_matches_loss/bw/Assert/AssertGuard/else/_4943/our_pillar_model/unsupervised_loss/knn_loss_components_3/get_flow_matches_loss/bw/Assert/AssertGuard/Assert}}]]

Here's a picture of the loss curve:

Could you please give some instructions on how to correctly set up the experiments? I would really like to include your method in my next submission.

Thanks

demmerichs commented 2 years ago

Hi @nchodosh,

thanks for your issue.

Our formulation in the README might have been a bit misleading. cli.py --prod is indeed the correct entry point, but we never used it stand-alone. We have a lot of assertions that try to check that all set config values work together, and if they do not, then an assertion is trying to tell you which config values are conflicting and why. We acknowledge, that those might not be super helpful, as they are quite low-level, and you need to have a lot of understanding of the config values, what they do, and why some configurations are not allowed, because they, in one way or the other, do not make sense.
We had to refactor a lot of code to comply with our company policy regarding code publication. This is also the reason why it took us so long (over half a year since publication of the paper) to even bring it out on GitHub. Sadly, at this point, we cannot guarantee that nothing did break during the refactoring, as we only did very superficial testing due to time constraints. And of course changing the TensorFlow and perhaps other package versions might introduce small changes in behavior that cannot be foreseen. As we ourselves do not have the time to try and reproduce your training on our side, we can give no insights from this high level, regarding this very generic bug of occurring infinities. We admit, that during all our trainings including segmentation prediction, we had many stability issues and the trainings were never really stable to begin with (we even have a dedicated source file for helper functions numerical_stability.py), so the curve you are showing does not seem to be way off, although we cannot remember exact scales and shapes of our loss curves. To be clear, with stability we are referring to the loss curve fluctuations which were still present in the final versions. We had during the development phase problems with infinities and nans, but for the final paper publication runs, those did not occur anymore through our efforts in safeguarding against them in specific situations, e.g. all/no points masked, gradient explosion for special values, weight sums of 0 etc. If you can try to pinpoint the source of the first infinity in the training (forward or backward pass), feel free to open a new issue. We feel very confident to discuss specific code segments with you and help you to get the experiment running. Common sources for us were the inference and gradient pass through the Kabsch operator (in this repo called weighted_pc_alignment) and the correct masking of points, as well as normalization and loss computation of segmentation predictions based on the self-supervised signal. Just having the self-supervised flow train using the point-adapted RAFT architecture and KNN signal, without any Kabsch and motion segmentation, produced much more stable trainings.

Thanks for your questions and good luck with your project.

LYFFF666 commented 10 months ago

@nchodosh hi, I'm very excited to see you successfully run the experiment, passing in the same parameters as you, but the training video memory will be exceeded, and a warning will be prompted : [UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape] , and eventually it will be terminated because the video memory is full. Pls are you having a problem with tf.gather() passing in parameters

mercedes-benz / selfsupervised_flow

Steps to Reproduce Results from Paper #4