keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.9k stars 19.45k forks source link

NTM : Mentioned in readme, not found in repo. #1520

Closed farizrahman4u closed 6 years ago

farizrahman4u commented 8 years ago

The neural turing machine by @EderSantana is mentioned twice in the readme. but was removed long ago from the repo.

fchollet commented 8 years ago

@EderSantana is there any way to bring it back?

farizrahman4u commented 8 years ago

Well, theoretically, all models in theano can be ported to tensorflow..convert op by op to Keras backend. Wouldn't that be possible? Whatever it takes, it would be really great to have it back, as it shows off the power and capabilities of keras.

EderSantana commented 8 years ago

I could only get it to work with the old mask/sample weight behavior (pre-backend) though.

But starting next week or so I should give another try since lots of changes were made to rnn.

I even think @wxs was trying to redo a few of his old pre-backend contributions. Was his PR accepted already?

farizrahman4u commented 8 years ago

@EderSantana #1310 is merged. The whole mask is passed to the rnn function instead of the masking boolean. This is pretty much similar to how things were in the pre-tensorflow era.

EderSantana commented 8 years ago

Tkx @farizrahman4u I'll check it out!!!

farizrahman4u commented 8 years ago

@EderSantana Any updates?

EderSantana commented 8 years ago

@farizrahman4u let me read this carefully, I used to be able to pass the mask by hand, using sample_weight, can I still do that? I resumed working on this...

wxs commented 8 years ago

@EderSantana you can't pass a per-timestep mask any more (see #1537 ). I'm not sure I understand the justification for removing this but I've not had a chance to investigate much further.

We need per-timestep masking for our use (as would anyone doing sequence to sequence learning) and so I will be looking into solutions in the next few days.

EderSantana commented 8 years ago

@wxs for my problem, I have all zeros as masking and regular inputs (when I have "no input" but still want the RNN to keep running). I'll try with -1 in that case, but yeah, being able to pass per time step mask would be awesome. Let me know when you PR and I'll give it a review. Thanks in advance for the work!!!

fchollet commented 8 years ago

There seems to be significant demand for having per-timestep weights besides sample weights. I will look into re-introducing them (maybe through a different interface from sample_weights, or maybe the same as before (but with less bugs ;-)).

On 27 January 2016 at 06:50, Eder Santana notifications@github.com wrote:

@wxs https://github.com/wxs for my problem, I have all zeros as masking and regular inputs (when I have "no input" but still want the RNN to keep running). I'll try with -1 in that case, but yeah, being able to pass per time step mask would be awesome. Let me know when you PR and I'll give it a review. Thanks in advance for the work!!!

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/1520#issuecomment-175665029.

wxs commented 8 years ago

Glad to hear it :) I'm not sure if there are any applications for per-timestep weights other than masking, so if for clarity you designed it as a separate parameter to fit that took a binary mask array instead of overloading sample_weights I suspect that would still cover every usecase.

fchollet commented 8 years ago

There is a simple, very general fix we could roll out (changing 3 lines in weighted_objective). It would make input validation more brittle, however. I will think about it for a little bit.

The nature of the fix: currently we reduce the loss array from nD to 1D (by taking the mean over axes [1..n]) then apply the weights (which should thus be 1D as well, i.e. there should be a 1:1 mapping between samples and weights, since the weights are sample weights). We could instead reduce the loss array to [1..p](p > 1), where p is the ndim of the weight array. We would thus assume that there is a 1:1 mapping between units in the first p axes of the network output, and units in the weight array. This generalizes to timestep weights, but also to more funky possibilities with p > 2 (I can't think of any application though).

Thoughts?

On 27 January 2016 at 11:49, Xavier Snelgrove notifications@github.com wrote:

Glad to hear it :) I'm not sure if there are any applications for per-timestep weights other than masking, so if for clarity you designed it as a separate parameter to fit that took a binary mask array instead of overloading sample_weights I suspect that would still cover every usecase.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/1520#issuecomment-175821588.

fchollet commented 8 years ago

I have thought about the fix for temporal sample weights detailed above, and I have decided it is the best solution. It is now in master. Fixes your issue @wxs @EderSantana

wxs commented 8 years ago

Great! Look forward to trying this. Thanks @fchollet .

cmyr commented 8 years ago

:clap: very happy to hear

EderSantana commented 8 years ago

@fchollet but we are still assuming the sample_weights are 1D https://github.com/fchollet/keras/blob/master/keras/models.py#L443

fchollet commented 8 years ago

@EderSantana I think the only way to tackle that would be through a pretty strong assumption: weights ndim should be 2 if and only if output ndim is 3. Other weight ndim is 1.

Other ideas?

fchollet commented 8 years ago

An alternative would be to introduce a sample_weight_mode argument to compile, that could take the value temporal. This would increase API complexity though. I think making the sample weight ndim fully configurable (as an integer) would be more general but also more difficult to understand for most users.

EderSantana commented 8 years ago

Maybe allow the ndim input, but use len(self.output_shape) or len(self.output_shape)-1 as default. sample_weight multiplies the final output, so the user might know (I think?) that it has to have matching dimensions to the final output or broadcastable dimensions.

Anyways, sample_weight requires an advanced API for complicated use cases. Make a choice here and let us simplify later as our experience and user feedback grows.

cmyr commented 8 years ago

I like @EderSantana's idea with having a default but allowing this to be passed as an argument, and coming up with a good API will depend on having a more exhaustive sense of what the use cases are, which will come through experience.

Thinking about this a bit more, maybe this is an opportunity to clarify the functional distinction between class weights and sample weights? That part of the api has always felt a bit awkward to me.

fchollet commented 8 years ago

I have a fix with extensive tests. Only one issue: it doesn't work with Theano. It works fine with TensorFlow. This is very strange, because: 1) the added code doesn't do anything exotic or complicated, it's actually very basic, and 2) Theano doesn't crash, it hangs, with no error message that could be used to diagnose the problem. This happens during the compilation step.

Check it out: https://github.com/fchollet/keras/tree/temporal_sample_weighting

fchollet commented 8 years ago

An update; actually it works in general, but the model I built for the tests does not compile with Theano. This appears to be a Theano optimizer bug, possibly linked to dimshuffle.

cmyr commented 8 years ago

okay, just checking out your branch now. Will report back.

fchollet commented 8 years ago

I have fixed the tests and will merge the branch in a few minutes (waiting for Travis). The bug isn't related to the changes in the PR.

cmyr commented 8 years ago

yea, this is working for me as expected in theano.

edit: well, mostly? Loss is lower then expected, going to dig into that a little bit.

fchollet commented 8 years ago

Merged. We now support both sample-wise and timestep-wise loss weighting, with arbitrary output data shapes.

EderSantana commented 8 years ago

Progress! It is almost working again. screen shot 2016-01-28 at 8 00 46 pm

farizrahman4u commented 8 years ago

@fchollet @EderSantana Good job!

wxs commented 8 years ago

@EderSantana you might want to take a look at #1593 if you're ever using categorical_crossentropy, as I think it was broken in the backend-switch for time-sequence cases.

EderSantana commented 8 years ago

@wxs I feel like there is something different, at least comparing the error magnitudes with what I used to have... well this how the memory looks like in the end. screen shot 2016-01-28 at 10 36 53 pm

I other words, it doesn't start writing early enough to generalize well.

cmyr commented 8 years ago

@EderSantana is that on wxs's branch(#1593), or pre?

I'm also having some issues with categorical_crossentropy but @wxs's branch doesn't run for me.

wxs commented 8 years ago

@EderSantana I definitely expect differences in error magnitudes. Before, the error across all timesteps was always being averaged, prior to timestep masking, if you were using categorical cross-entropy.

EderSantana commented 8 years ago

NTM uses binary_crossentropy though. But still see differences compared to previous implementations. It looks like LSTMs are more powerful and overfitting or something. I don't know yet. I'll keep experimenting with this.

EderSantana commented 8 years ago

I think we did it again boys. But NTMs are really sensitive to hyperparameters, which Graves et. al. didn't mention. I have to write a few notes about it: screen shot 2016-01-30 at 10 49 39 am

farizrahman4u commented 8 years ago

@EderSantana Great job :+1: Note that you can your PR your notes to the blog repo. I remember you had some NTM wiki that you posted as an issue and later as a wiki page. It was pretty good. Brushing it up should do the job. www.github.com/fchollet/keras-blog

fchollet commented 8 years ago

@EderSantana very cool! Any plans to make a new PR for the NTM example?

I agreed with @farizrahman4u, this would make a great post on the blog, if you want to release a write-up.

fferroni commented 7 years ago

@EderSantana what is the status of NTM implementation in keras?

EderSantana commented 7 years ago

I haven't worked on that in a long time. There is an implementation here https://github.com/EderSantana/seya But NTM were less useful than I expected, so I'm not working on them anymore...

fferroni commented 7 years ago

Thanks @EderSantana. If it's not so useful, any tips on what other architectures networks are useful for memory (i.e. remembering things from a long time back)?

carlthome commented 7 years ago

The sucessor to the NTM is called differentiable neural computer (DNC), see the paper and the corresponding blog post. My gut feeling is that for real-world applications with RNNs today, one should stick with a regular GRU and spend most time cleaning the training data and thinking carefully about the input and output representation instead.

Joshuaalbert commented 7 years ago

@carlthome Yes, but they leave out all mathematical details to code the DNC in the paper. They report some astoundingly complex abilities of the machine. Is it silly to say the authors don't think the world is ready for the DNC? Or else why not release the mathmatical bits.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.