piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.45k stars 4.37k forks source link

Benchmark ML frameworks on different hardware platforms #1418

Open menshikh-iv opened 7 years ago

menshikh-iv commented 7 years ago

We are very interested in a robust large-scale benchmark of the ML landscape, especially with regard to hardware, costs and implementation quality.

Short description: compare a neural network algorithm (perhaps w2v / d2v) implementation across popular frameworks, on different cloud platforms, different hardware setups (CPU/GPU), measuring various metrics such as training quality, speed, memory footprint, ease of use and relative $$$ costs.

Questions we want to answer:

Plan:

  1. Choose a model from gensim that we will compare (probably w2v, as this is the most popular model)
  2. Take the same model from other popular frameworks: Tensorflow, DeepLearning4J, original C implementation, Spark (single node, cluster)
  3. Take a big enough corpus (e.g. Wikipedia or another publicly available corpus)
  4. Take a popular hardware provider: IBM Softlayer, AWS, SkyScale, Hetzner
  5. Choose an execution model: CPU (incl. multicore), GPU
  6. Fit a model, measure and report several metrics:
    • Time to train
    • Peak memory usage
    • Model quality, e.g. on the standard "word analogies task"
    • Total cost of training and the complexity of setup/usage for the given hardware provider
    • Complexity of setup/usage for the given ML framework -- how difficult is it to install, run and debug

The benchmark must be fully reproducible -- all scripts, data and settings must be recorded and versioned. It is also necessary to explicitly describe and set all relevant parameters, random seeds, etc. It is very important to write fully self-contained scripts for repeatable deployment. For example, you can use Docker/Ansible. Run the experiments multiple times, measuring the spread/variance of each metric.

Results:

  1. Answers to the questions above, in the form of hard measurements and clear tables with summaries.
  2. A dedicated Github repo with all the scripts, configs and data links, so anyone can repeat the benchmarks themselves.
  3. A blog post on the RaRe site describing the setup, methodology, results and final recommendations.
manneshiva commented 7 years ago

@menshikh-iv Sounds really useful. I am interested to work on this issue.

jayantj commented 7 years ago

Other potentially useful evaluations of word embeddings (along with code) can be found here - https://github.com/mfaruqui/eval-word-vectors

souravsingh commented 7 years ago

@menshikh-iv I have finished training the Gensim's Word2vec on a Google Cloud n1-highcpu instance ( 4 core Xeon E5 with 3.6 GB RAM) and it takes around 7.5 hours to train a model on Wikipedia corpus. I will look into Tensorflow and Word2Vec C code.

menshikh-iv commented 7 years ago

@souravsingh @manneshiva maybe you will be work together?

manneshiva commented 7 years ago

It is very important to write fully self-contained scripts for repeatable deployment.

Before we even start running the benchmarks, we should focus on the setup to make everything(tests, scripts etc.) reproducible. Using docker seems to be the easiest way to achieve this. I have build a docker which will allow us to run word2vec implementations of all popular frameworks. Also tested(ran) it with original c, tensorflow-cpu, gensim and dl4j codes on a small test corpus(text8). Will be pushing the code to a repo as soon as I refactor and write a few scripts.

menshikh-iv commented 7 years ago

@manneshiva you are right, keep it up!

manneshiva commented 7 years ago

@menshikh-iv I have created a repo to address this issue. Here is the link: https://github.com/manneshiva/benchmark-word2vec-frameworks Still requires a lot more things to be finished. Working on it. Will complete it soon.

piskvorky commented 6 years ago

@manneshiva A similar post, for inspiration: http://minimaxir.com/2017/07/cpu-or-gpu/