mstksg / backprop

Heterogeneous automatic differentiation ("backpropagation") in Haskell
https://backprop.jle.im
BSD 3-Clause "New" or "Revised" License
180 stars 22 forks source link

backprop as functor #9

Closed vincent-hui closed 5 years ago

vincent-hui commented 6 years ago

Hi,

Did you read Backprop as Functor: A compositional perspective on supervised learning ??

Will you implement the concepts described in this paper in Haskell with backprop?

mstksg commented 6 years ago

Hi! Thanks for the comment.

I do explore something similar in this post. As for the specific paper, the concepts are universal enough that backprop can already be used out-of-the-box to implement the Param a b -> Learn a b Functor action, with just gradBP and some minor parameter re-arrangement.

newtype Para p a b = forall z. Reifies z W => BVar z p -> BVar z a -> Bvar z b
data Learn p a b = Learn { implement :: p -> a -> b
                         , update    :: p -> a -> b -> p
                         , request   :: p -> a -> b -> p
                         }

paraToLearn
    :: (Backprop p, Backprop a, Backprop b)
    => Param p a b
    -> Learn p a b
paraTolearn f = Learn
    { implement = evalBP2 f
    , update    = \p x y -> p - learnRate * (gradBP (\p' -> lossFunction y (f p' x)) p)
    , request   = \p x y -> x - learnRate * (gradBP (\x' -> lossFunction y (f p x')) x)
    }

And that's pretty much it! Both this library and the paper are built upon pretty universal/fundamental principles and are sufficiently abstract enough that they can easily be translated between each other. So, I don't think it's really a matter of "will you implement?" as much as, it's already trivially implementable.

The above paraToLearn is the functor action that the paper describes, and it should obey the functor laws if we consider their composition:

compPara
    :: (Backprop p, Backprop q, Backprop a, Backprop b, Backprop c)
    => Para p b c
    -> Para q a b
    -> Para (p, q) a c
compPara f g (T2 p q) = f p . g q

idPara :: Para p a b
idPara _ x = x

compLearn
    :: (Backprop p, Backprop q, Backprop a, Backprop b, Backprop c)
    => Learn p b c
    -> Learn q a b
    -> Learn (p, q) a c
compLearn f g = Learn
    { implement = \(p, q) -> implement f p . implement g q
    , update    = \(p, q) x y -> ( update f p (implement g q x) y
                                 , update g p x (request f p (implement g q x) y)
                                 )
    , request   = \(p, q) x y -> request g q x (request f p (implement g q x y))
    }

idLearn :: Learn p a b
idLearn = Learn
    { implement = \_ x -> x
    , update    = \p _ _ -> p
    , request   = \_ x _ -> x
    }

Written so that:

paraToLearn f `compLearn` paraToLearn g
    = paraToLearn (f `compPara` g)

paraToLearn idPara
    = idLearn

The monoidal category laws also follow straight from the paper.

mstksg commented 6 years ago

For the sake of completeness, here's the monoidal category aspect, which is also pretty important in the paper:

multPara :: Para p a b -> Para q c d -> Para (p, q) (a, c) (b, d)
multPara f g (T2 p q) (T2 x y) = T2 (f p x) (g q y)

multLearn :: Learn p a b -> Learn q c d -> Learn (p, q) (a, c) (b, d)
multLearn f g = Learn
    { implement = \(p, q) (x, y) -> (implement f p x, implement g q y)
    , update    = \(p, q) (x, y) (zx, zy) -> (update f p x zx, update f q y zy)
    , request   = \(p, q) (x, y) (zx, zy) -> (request f p x zx, request g q y zy)
    }

However, I don't think that this monoidal formulation for "parallel" composition is very convenient or practical, and this is one situation where I think backprop shines. Monoidal categories are useful as a theoretical concept, but using compLearn and multLearn are extremely cumbersome in practice unless you're using a concatenative language like forth.

backprop makes things much simpler by not requiring you to structure things as compositions and products, but rather by letting you essentially build Learns by writing normal Haskell functions. The Learn's are then automatically derived. You aren't required to manually specify implement, update, and request, or stick to the comp and mult primitives like the paper requires; using backprop, these fields are all automatically generated for you from normal haskell functions.

mstksg commented 6 years ago

After a little more thought, I think that backprop sort of makes the constructs in the paper somewhat obsolete for practical purposes. The paper is a good theoretical/mathematical demonstration. However, it presents Learn as a tuple:

data Learn p a b = Learn
    { implement :: p -> a -> b
    , update    :: p -> a -> b -> p
    , request   :: p -> a -> b -> p
    }

It requires all parts of this tuple to do its job. However, this tuple is over-specified; it's essentially a partially applied version of backprop2. In reality, we can represent that entire tuple in backprop as the function:

type Learn p a b = forall z. Reifies z W => BVar s p -> BVar s a -> BVar s b

Which is the type that this post uses.

That single function completely represents a Learn tuple, and it has the advantage: it is simply a normal haskell function, not a data type. So, we do not require special operators or combinators or functions to manipulate it. We can just manipulate it as a normal function, with normal function composition, etc.

The paper introduces the Learn tuple because it believes that it is necessary to represent a trainable model; however, all we really need is a BVar function to get the same power, and more. In a way, backprop makes the paper obsolete for practical purposes.

vincent-hui commented 6 years ago

Thank a lot

zhujinxuan commented 5 years ago

Hi, sorry to trouble with the closed issue.

I feel like there is something interestingly similar with BVar and Learn. In Fong's later paper (https://arxiv.org/abs/1903.03671), we can think Learn as a Lens between (p,a) and b, as

data Learn p a b = (p,a) -> (b, b -> (p,a))

And compose can be thought by

convert ::  (Lens (p,q) p) -> Learn p a b -> Learn (p,q) a b

And in backprop, we can also have something similar

type BVar s (p,a) -> BVar s b
    = (p,a) -> (b, b -> (p,a))

Perhaps there is some more interesting structure behind the analogue?

meygerjos commented 3 years ago

Also sorry to resurrect this thread, but @mstksg backprop2 only suffices for the structure in the paper if 1) we are just looking at things in the image of the functor L_ε,e: Para -> Learn, and 2) e is the quadratic error function. Without the first assumption, update will not be expressible in terms of the gradient of the error function (what error function! an error function is only mentioned in the definition of this functor, it is not intrinsic to the category Learn, which is much more general). And if we just have the first assumption and not the second, request: p -> a -> b -> a will have a more complicated formula, shown in the paper; this only reduces to \p x y -> x - learnRate * (gradBP (\x' -> lossFunction y (f p x')) x) in the case of a quadratic error function.