nyu-dl / dl4mt-tutorial

BSD 3-Clause "New" or "Revised" License
618 stars 249 forks source link

Would it be possible to release code in Python 3.4+ for any future repos? #50

Closed jli05 closed 8 years ago

jli05 commented 8 years ago

I find this repo is a reference in this domain and highly important. I just hope we could migrate to Python 3.4+ for any future repos, so that we're future-oriented and moving forward.

orhanf commented 8 years ago

@jli05 would you mind stating the pros and cons of this in the context of nmt and dl4mt repo here, thanks a lot.

jli05 commented 8 years ago

Basically I hope this and future repos achieve more than merely as a tool for academic research.

Cons:

  1. Time on refactoring the current software/data systems to work with Python 3.4+. The amount of effort is proportional to the size of current systems.
  2. A little learning curve, max half a day.

Pros:

  1. Evolving with time. Python 2.7 is entering the End Of Life cycle till 2020 (PEP 373); Ubuntu is phasing out install of Python 2.7 by default. It's not hard to imagine newer releases of RHEL will adopt similar policies.
  2. Allows other parties to contribute who're developing with modern choice of languages.
  3. Allows constant maintenance of the repos so that it's a thriving experience for whoever maintain them.

Theano is one of the rare academic repos that's constantly maintained and streamlined, which can rival corporate offerings in a certain aspect. If we google for BlinkDB, the last commit was two years ago -- it remained a good concept that earned the author a paper. I believe software has wider and deeper impacts than publications today. I know this repo was initially put together as a tutorial; however I'd wish more for it and all future sequels: to be a point of reference for NMT, which showcases excellence in algorithm and brick-and-mortar code.

kyunghyuncho commented 8 years ago

@jli05 Thans for sharing your view! I think this definitely makes sense.

@orhanf How about we don't fully validate the changes (as in training a full model and verifying that the same BLEU scores for all the models and language pairs we've tried), but simply make it runnable with Python3?

@jli05 Would you kindly help us move toward Python3 by making a PR for, say, one of the sessions?

jli05 commented 8 years ago

Sure I'll do it in the coming 1-2 weeks. Thanks @kyunghyuncho @orhanf !

jli05 commented 8 years ago

Could we confirm what's the exact workflow in data/?

My understanding is that for a simplest setup, we could run setup_local_env.sh. Is that sufficient? We made no call to preprocess.sh therein?

All the sessions take in the tokenised wiki dump and its associated dictionary from wiki.tok.txt.gz and wiki.tok.txt.gz.pkl. Currently we don't have scripts for generating them?

kyunghyuncho commented 8 years ago

@orhanf can you answer this?

orhanf commented 8 years ago

@jli05 , a detailed description is provided with #53 but let me clarify this further here.

setup_local_env.sh was the initial script provided in the repo, intended to download an example data and preprocess it (only tokenization). Later on, in order to use subword-units (bpe), another script was added (preprocess.sh), which pre-processes existing data (tokenize, learn bpe, apply bpe and shuffle)

We now merged both functionalities into setup_local_env.sh and made it to call preprocess.sh optionally (with -b flag) when you want to use bpe.