Closed 110wuqu closed 2 years ago
The data set used in Figure 2 is expanded on the data set used in Figure 1, but it seems difficult to get below -500 and converges quickly .
the loss in Figure 1 will be lower than that in Figure 2, which can indicate that the performance of the first trained model will be better?
A lower loss often indicates a better model, but definitely not always. The loss function we have is a number that we can easily calculate, but it is only a proxy for what we really care about, which is something like be "does the generated motion look good?" The second bit is just harder to put a number on, so we use a standard loss function to approximate it, even though that approximation can be pretty inaccurate at times.
When training animation models, there is no substitute for comparing two models by actually looking at their output, and picking the one that looks better. If you choose your model only based on having a low loss (whether on the training or, especially, the validation data), you might end up selecting a worse model.
As a side note, this problem of training on one number even though it isn't the thing we actually care about is not unique to motion or synthesis, but is common in machine learning in general. Think about classification, for example: Why do we train classifiers to minimise the "cross-entropy loss", when we actually care about the accuracy? Same thing.
The data set used in Figure 2 is expanded on the data set used in Figure 1, but it seems difficult to get below -500 and converges quickly.
By mixing the two datasets, you have created a trickier machine-learning problem than you had before: now there are more different things in the training data for the model to learn. This is not necessarily a bad thing. It might lead to a better model in the end, for example one that can make the motions in both the original data and in the expansion, even if that model at the same time has a worse loss than before, because learning to do everything is hard. So a higher loss again does not have to imply a worse model.
That said, the degradation in loss between the two training runs (the two datasets) is larger than I might have anticipated in this case. You should check if your approach is working by visualising what your models have learnt. What does the motion output look like for the two different models? Does adding data give a better or worse visual result?
There can be many subtle differences between datasets, e.g., different skeletons, different rest pose, and so on. Adapting "raw" 3D motion-capture data for machine learning furthermore involves a lot of processing steps. If you mix two datasets that were not set up and preprocessed in the same way, many deep generative models (including MoGlow/StyleGestures) might give a much worse result than if you trained on either dataset separately.
Even things like the data normalisation and the rest pose used (which essentially is invisible when visualising motion) can have a big impact on the final motion quality. Our experience shows that mixing two datasets frequently leads to worse results, unless you are really good at processing 3D motion capture data for animation so that the two datasets are compatible.
Thank you for your great work!
Thank you for you kind words. :)
That said, could you perhaps change the title of this issue to something that is more descriptive about your question/problem? Other people also use the GitHub issues to find answers to their questions, and if one looks at the list of issues for this repository, the current issue title is not really informative regarding what the issue actually is about.
Thank you for your patient reply. I have another question I would like to consult on Robust Model Training and Generalisation with Studentising Flows. In this article, Your mention that the NLL on the held-out data starts to rise sharply under the Gaussian basis distribution early in training is misleading. Will the model before the convergence show better results? Does this need to be verified step by step?
Robust Model Training and Generalisation with Studentising Flows
Thank you for reading our Studentising flows paper. :)
Your mention that the NLL on the held-out data starts to rise sharply under the Gaussian basis distribution early in training is misleading.
Yes indeed. It is misleading in the sense that the NLL on the validation data gets worse but, visually, the model output is still improving. When training models for synthesis problems, the NLL on validation data is not a good performance metric, and should not be relied upon for, e.g., early stopping. I think this is true both with and without Studentising flows, although Studentising flows reduce the magnitude of the issue.
Will the model before the convergence show better results?
My apologies, but what do you mean by "convergence" in this case?
Most often when we train our models, the training-set NLL keeps improving the longer we train, similar to the curves with 500k+ training updates in the first figure you posted here. Those models have not truly converged. We usually don't have time to wait until true convergence, which would be when the training curves flatten out completely.
If you by "convergence" mean the local minimum in validation set NLL, then our experience is that the output looks better if you keep training even as the validation set NLL starts getting worse again; see the point about early stopping above.
Does this need to be verified step by step?
I am not completely sure what this question means. If this is about assessing how the quality of the output changes during training, then yes, the best way to know whether the model is getting better or worse is to look at the output periodically during training.
It can be difficult to tell if you are making progress just by looking at motion videos (after all, the output is random) but if you do see a difference, then you have clear evidence of what is best. Contrast this with the NLL: it is easy to measure how much the NLL is going up or down during training, but you have no clear idea of what those changes in NLL value actually mean for the visual quality of the model.
Thank you so much for answering my question perfectly.
This is the curve when I use two different data sets for training, and the parameters are the same. It can be roughly seen that the loss in Figure 1 will be lower than that in Figure 2, which can indicate that the performance of the first trained model will be better?