rtqichen / torchdiffeq

Differentiable ODE solvers with full GPU support and O(1)-memory backpropagation.
MIT License
5.62k stars 933 forks source link

Wondering the reason you use two resblocks in supervised learning test. #74

Closed zlannnn closed 4 years ago

zlannnn commented 5 years ago

Hi Chen Your works and codes are really great.

In supervised learning test, you used 2 resblocks + ODE comparing with 2resblocks (or 2 convs)+ 6 resblocks. As you have already proved, ODE could replace those Resblocks. But you didn't replace all of them, two resblocks are left with ODE. Based on your paper, Dupont proposed Augmented ODE and pointed out that NODE has limitations while dealing with non-linear separable problems. But as Eular discretization, Res structure could somehow avoid this issue. With /without two resblocks, the results and training time will change a lot. So what I was wondering is that two blocks are designed to avoid some shortage which mentioned by Dupont?
Dupont proposed Augmented ODE and did not want to use any discrete layers. But in my opinion, some small structures which contain very few paras and require low FLOPs are acceptable for faster training.
What do you think?

Regards!

rtqichen commented 5 years ago

We wanted to learn a hidden space where modeling an ODE would be more sensible.

For instance, any high-order ODE can be written as a first-order ODE with a larger state. Any time-dependent ODE can be modeled with an autonomous ODE with one extra state. A PDE can be discretized into a very high dimensional ODE.

Since many differential equations can be rewritten in terms of / approximated by a first-order ODE, with the only difference being the size of the state, it made sense to learn the space itself that the ODE must traverse. If applied directly to the input, we would need to explicitly model time (which we had to do for continuous normalizing flows), but if we can make use of a hidden space then we can simultaneously fit an ODE and modify the hidden space so that the ODE is "simpler" or in some sense more natural.

The augmentation in the ANODE paper is really a special case of a learned hidden state. In general, we can also design a space that imposes or encourages certain types of paths.

(Btw, ODEs can't replace resblocks in input-space because there exist equations in the form of Euler "discretizations" that do not correspond to any ODE. However, the argument is more nuanced when the space itself can be learned/designed.)

zlannnn commented 5 years ago

Thanks for your explain!

For "make use of hidden space" you mentioned. Is that means we should use some tiny structures like 2 Resblocks you did in input-space to provide an easier surface for ODE to learn. Which would make the whole ODENet better, faster and avoid the unnecessary cost of any resources?

zlannnn commented 5 years ago

I recently replaced two Resblocks with fewer paras and FLOPs structures for supervised learning which operating faster with the same performance. But I am not sure the reason you left the two resblocks, that is the reason why I opened this issue.

rtqichen commented 5 years ago

I simply chose it as a first experiment and since it's the architecture we reported for the paper, I used the same architecture for the mnist example. There's no real reason for the choice.

zlannnn commented 5 years ago

Thanks for your patient explanation!