Open rizar opened 9 years ago
Up! @sebastien-j @orhanf @kyunghyuncho
This was assigned as a CCW ticket to @orhanf starting this week.
Nice. At some point also https://github.com/bartvm/blocks/issues/586 should be fixed to get rid of the Bidirectional
copy used here. Another related ticket is https://github.com/bartvm/blocks/issues/502, where we converged at the design everybody liked, it just has to be implemented.
From the experience with the Groundhog implementation, outside people don't care about code quality as much as they want to have clear instructions on how to train the model on their data.
I will contradict myself from the previous post that says that code quality is not the top priority, but can we get rid of the term 'state'? This is not a 'state' this is 'configuration', config
would be a reasonable abbreviation, I guess.
@rizar you are right, using the term state
is confusing since it is already being used by the model itself. I am changing it to config
as you suggested.
@orhanf @sebastien-j @ejls
What do you guys think of making this repo public soon? Is it mature enough? Can we reproduce at least one result from our earlier papers?
@kyunghyuncho as you know cost and generated samples are matching, but we still have some minor issues such as #22, #23. And we do not have any replicated results with blocks.
Also today we talked with @bartvm and he will soon create a separate public repo for us to work on going public
Does anybody work on a public reference Blocks mplementation for MT?
It was assigned as a CCW ticket to @orhanf and he's put work in already (https://github.com/orhanf/blocks-examples/tree/nmt), but I guess that with all the deadlines it got stalled a bit. Should pick up again soon.
We have exchanged a couple of mails with @bartvm about the ticket, but because of the deadlines the ticket was postponed. I will start working on it this week and try to finish it up.
Here is the current status in https://github.com/orhanf/blocks-examples/tree/nmt from the last mail:
"I selected the Czech->English translation from WMT15 since it is has the least messy preprocessing pipeline. Only tokenization and shuffling (we did many more for wmt15 but i dont know if they are relevant/necessary for the example). I referred to GroungHog preprocessing pipeline for tokenization and dictionary generation.
Currently there is no fuel support/class for this dataset, therefore we expect everything is put under a directory and indicated in the config file. I dont know how to proceed on this one but if we are going to engage implementing on fuel, that might be another issue we should discuss beforehand.
And the things for TODO (also indicated in the files): init.py: -Most of the classes are extended from blocks, which can be implemented in blocks with minor changes: MainLoopDumpManagerWMT15, LoadFromDumpyWMT15, LookupFeedbackWMT15, BidirectionalWMT15, GRUInitialState. -RemoveNotFinite should be added to CompositeRule -Documentation is lacking for almost all classes.
main.py: -Getting the configuration is ugly, because of some pickling issues, i could not find any other workarounds, suggestions would be great. https://github.com/orhanf/blocks-examples/blob/nmt/machine_translation/main.py#L37 -Documentation should be enhanced
sampling.py: -BeamSearchWMT is extending BeamSearch due to the issue here: , which should be resolved. -Documentation is lacking for almost all classes.
stream.py: -Parametrization of ifilter should be changed accordingly, with the suggestions here: (i could not find time to work on this tho) -MappingWithArgs is extending Mapping which should be fixed -FilterWithArgs is a duplicate of Filter which should be fixed -Getting configuration is also a problem here
Another VERY important thing that we are dealing is, still we are not able to get the same (even close) BLEU scores with GroundHog model. Etienne trained Finnish-English models and the highest score was in the range of 9, where using GroundHog we are able to reach 12 range. This is also a serious issue that should be discussed. "
Guys, when you have a gap in your sequence of deadlines, I think it would be nice to document this code and make it available to general public as a successor of the Groundhog implementation. I would be happy to help you with it. Please let me know when is the appropriate time for that.