Closed jli05 closed 8 years ago
@jli05 would you mind stating the pros and cons of this in the context of nmt and dl4mt repo here, thanks a lot.
Basically I hope this and future repos achieve more than merely as a tool for academic research.
Cons:
Pros:
Theano is one of the rare academic repos that's constantly maintained and streamlined, which can rival corporate offerings in a certain aspect. If we google for BlinkDB, the last commit was two years ago -- it remained a good concept that earned the author a paper. I believe software has wider and deeper impacts than publications today. I know this repo was initially put together as a tutorial; however I'd wish more for it and all future sequels: to be a point of reference for NMT, which showcases excellence in algorithm and brick-and-mortar code.
@jli05 Thans for sharing your view! I think this definitely makes sense.
@orhanf How about we don't fully validate the changes (as in training a full model and verifying that the same BLEU scores for all the models and language pairs we've tried), but simply make it runnable with Python3?
@jli05 Would you kindly help us move toward Python3 by making a PR for, say, one of the sessions?
Sure I'll do it in the coming 1-2 weeks. Thanks @kyunghyuncho @orhanf !
Could we confirm what's the exact workflow in data/
?
My understanding is that for a simplest setup, we could run setup_local_env.sh
. Is that sufficient? We made no call to preprocess.sh
therein?
All the sessions take in the tokenised wiki dump and its associated dictionary from wiki.tok.txt.gz
and wiki.tok.txt.gz.pkl
. Currently we don't have scripts for generating them?
@orhanf can you answer this?
@jli05 , a detailed description is provided with #53 but let me clarify this further here.
setup_local_env.sh
was the initial script provided in the repo, intended to download an example data and preprocess it (only tokenization). Later on, in order to use subword-units (bpe), another script was added (preprocess.sh
), which pre-processes existing data (tokenize, learn bpe, apply bpe and shuffle)
We now merged both functionalities into setup_local_env.sh
and made it to call preprocess.sh
optionally (with -b
flag) when you want to use bpe.
I find this repo is a reference in this domain and highly important. I just hope we could migrate to Python 3.4+ for any future repos, so that we're future-oriented and moving forward.