Adding neural networks to HLearn

abailly commented 9 years ago

Hi, I have been working (on and off) on a port in Haskell of Google's word2vec, first out of fun and then lately out of professional interest. My code is rather rudimentary and is here: https://github.com/abailly/hs-word2vec

It kinda works, at least in the sense it outputs something (a model, PCA data, SVG graphics) but I am running into my lack of real knowledge of neural networks in particular, and machine learning in general. I would like to cooperate with other people in order to:

improve that code to the point it is usable,
increase my knowledge and understanding of ML techniques and tools to the point I can put HLearn (or others) at use in my day to day work,
make it efficient using parallelization tools available in Haskell.

Is this something that might be interest to this project? Is my code some interesting starting point or I should just erase it and restart from scratch using other tools/techniques?

Thanks for your help,

vapniks commented 9 years ago

I might be interested.. I have a Ph.D. in machine learning, and fairly good knowledge of category theory and functional programming. Have read through the Bananas etc. paper. Got a few other projects that I need to clear out of the way, but want to get into haskell machine learning ASAP.

mikeizbicki commented 9 years ago

This looks like an awesome project! I've been wanting to do some more word2vec style work lately, so I'd love to help you port this to HLearn :)

This would be a two step process:

You'd first port your code to using the subhask library for the math. This should make your code a bit faster, more readable, and let you get rid of those mutable IO hacks. (Soon it will automatically improve your code's numeric stability as well once I can work around a GHC bug :) IIRC, word2vec uses sparse vectors multiplied by dense matrices. This functionality isn't implemented in subhask yet, but it would be pretty easy for me to do so, and I'd be happy to do that for you. I've been planning on doing it for a while, I just haven't had an excuse :) If you can point me directly to the lines of code in your project where all the math stuff is, I can look through it, make sure that all the functionality you need really is in subhask, and give you some guidance on what the first steps would look like.
Next, you'd port your code to the History monad in the HLearn library. I haven't written too much about how this monad works yet, so this is likely to require me to be a bit more directly involved. The basic idea is that we want to divide machine learning algorithms into two separate pieces of code: one for the optimization algorithm and one that is problem specific. This promotes code reuse (because the optimization code can usually be applied to other learning problems as well), and let's us use some fancy debugging features to visualize what exactly is happening within the optimization.

The first step should be relatively straightforward and just involve Haskell knowledge. The second step is going to require thinking a bit more deeply about the underlying machine learning and so will probably be more difficult.

abailly commented 9 years ago

HI Mike, Thanks for the encouraging words! The code doing the neural network training is around here: https://github.com/abailly/hs-word2vec/blob/master/Model.hs#L171 Not pretty I am afraid, and probably not even correct...

I had a look at SubHask, this looks impressive and a bit daunting but I guess providing efficient and easy to use math operations interface in Haskell comes at a price :-) I think that even getting to do point 1. and reaching a point where 1/ computations are more efficient and 2/ are easier to understand would already be a great step forward. Currently my NN code is totally ad hoc and heavily inspired by original C code (which itself wasn't extremely elegant...). I had a look at several NN packages out there but given my limited knowledge of the field it was hard for me to decide basing my work on one or the other, so I resorted to doing computations directly.

Thanks for your help.

mikeizbicki commented 9 years ago

I'm impressed by how well documented the code is and easy for me to follow :) After looking through it, it seems like it really should be a straightforward conversion. Am I correct that you're not actually using sparse vectors anywhere?

Here are some notes I took while reading through your code:

Your Layer type (I.IntMap Vector) should change to SMatrix (see the code). The r parameter is the field you'll be doing calculations over (probably Double), and the m and n specify the number of rows and columns in the matrix. Since you don't know these values at compile time, you'd use phantom types for these parameters. (I can explain this in more detail if you haven't seen something similar before.) There's a short tutorial on how to use the linear algebra operations in subhask that'll probably help. I would expect a TON of speed improvements just from this type change alone because IntMap is a relatively slow data structure.
Ahh... now that I look at your updateLayer function, I see why you chose to use an IntMap for the layer. It's because you're modifying the matrix one vector at a time, right? In order to get a mutable version of any type in subhask, you prefix the type with Mutable. So Mutable m (SMatrix Double m n) would be a mutable matrix in the monad m. I don't believe there's any tutorials right now for using the mutable structures, but I'll make one if you really do need mutability.
This line of code: let neu1eInitial = fromListUnboxed (ix1 layerSize) (replicate layerSize 0) would use the unsafeToModule function in subhask. (A Module is a generalization of a vector.) SubHask is trying to be a bit more principled mathematically than repa.
Your updatePoint code can be simplified quite a bit. There's no need to call all sorts of extra functions like you do with repa. I think the linear algebra tutorial above should get you started pretty well with that.
There's no need for numberOfWords and modelSize to be stored within the Model itself. They will get stored with the phantom parameters mentioned above. So if for some reason these values are known at compile time, they would get completely optimized away by the compiler.
Your Code and Dictionary types could be modified to use SubHask's container hierarchy, but that's probably less important to you?

If you have any more questions, please don't hesitate to ask!

abailly commented 9 years ago

Hi Mike, Thanks. Given the domain was pretty new to me I tried to document code for my future self. Note that the overall structure should really be improved (module names, source directory, no tests...).

Actually yes, I resorted to using an IntMap out of luck when I tried direct matrix operations over the whole layers: it was extremely slow, no matter what. So I went for a less elegant but at least bearable mutable implementation...

I will start with the linalg tutorial and see if I can convert my code. GIven the overall ambitious goal of SubHask to "replace the Prelude", I assume I can still mix and match "Good Old Prelude" and SubHask, concentrating on the mathematical part?

Thanks again for your help, hope I will be able to get somewhere.

FWIW I am taking a 4-days intensive training in ML in November and the coding is supposed to be in Python. I plan to actually do the course in Haskell, hopefull using HLearn :-)

mikeizbicki commented 9 years ago

I assume I can still mix and match "Good Old Prelude" and SubHask, concentrating on the mathematical part?

The easiest way to do this is to import Prelude qualified as P and then put P. everywhere where GHC complains. Then you can slowly go through and update one piece at a time to SubHask.

mikeizbicki / HLearn

Adding neural networks to HLearn #64