Closed nchodosh closed 2 years ago
Hi @nchodosh,
thanks for your issue.
Our formulation in the README might have been a bit misleading. cli.py --prod
is indeed the correct entry point, but we never used it stand-alone. We have a lot of assertions that try to check that all set config values work together, and if they do not, then an assertion is trying to tell you which config values are conflicting and why. We acknowledge, that those might not be super helpful, as they are quite low-level, and you need to have a lot of understanding of the config values, what they do, and why some configurations are not allowed, because they, in one way or the other, do not make sense.
We had to refactor a lot of code to comply with our company policy regarding code publication. This is also the reason why it took us so long (over half a year since publication of the paper) to even bring it out on GitHub. Sadly, at this point, we cannot guarantee that nothing did break during the refactoring, as we only did very superficial testing due to time constraints. And of course changing the TensorFlow and perhaps other package versions might introduce small changes in behavior that cannot be foreseen. As we ourselves do not have the time to try and reproduce your training on our side, we can give no insights from this high level, regarding this very generic bug of occurring infinities. We admit, that during all our trainings including segmentation prediction, we had many stability issues and the trainings were never really stable to begin with (we even have a dedicated source file for helper functions numerical_stability.py), so the curve you are showing does not seem to be way off, although we cannot remember exact scales and shapes of our loss curves. To be clear, with stability we are referring to the loss curve fluctuations which were still present in the final versions. We had during the development phase problems with infinities and nans, but for the final paper publication runs, those did not occur anymore through our efforts in safeguarding against them in specific situations, e.g. all/no points masked, gradient explosion for special values, weight sums of 0 etc. If you can try to pinpoint the source of the first infinity in the training (forward or backward pass), feel free to open a new issue. We feel very confident to discuss specific code segments with you and help you to get the experiment running. Common sources for us were the inference and gradient pass through the Kabsch operator (in this repo called weighted_pc_alignment
) and the correct masking of points, as well as normalization and loss computation of segmentation predictions based on the self-supervised signal. Just having the self-supervised flow train using the point-adapted RAFT architecture and KNN signal, without any Kabsch and motion segmentation, produced much more stable trainings.
Thanks for your questions and good luck with your project.
@nchodosh hi, I'm very excited to see you successfully run the experiment, passing in the same parameters as you, but the training video memory will be exceeded, and a warning will be prompted : [UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape] , and eventually it will be terminated because the video memory is full. Pls are you having a problem with tf.gather() passing in parameters
Hi, thanks for releasing this code! Very cool project. I'm trying to reproduce your results from the upcoming CVPR paper but the experiments as given in the repo don't seem to work. I successfully installed all the requirements (took some modification as the specified tf version is now very old), built the user ops, ran the create.py scripts to generate the datasets and tried to run the main training but then had some issues.
cli.py --prod
errors out immediately when theassert 'dynamic in eval flow types
fails.sota_us
andsota_net
arguments which lets it actually train but the loss is extremely unstable and errors out afte 1-6k iterations with this:Here's a picture of the loss curve:
Could you please give some instructions on how to correctly set up the experiments? I would really like to include your method in my next submission.
Thanks