ropensci / textworkshop18

9 stars 6 forks source link

Performance comparisons #2

Open kbenoit opened 6 years ago

kbenoit commented 6 years ago

The idea here is to a challenge text analysis task, as a framework .Rmd or .ipynb file and associated dataset. (I propose the Large Movie Review Dataset from http://ai.stanford.edu/~amaas/data/sentiment/).

The purpose is to gauge the complexity of the execution of the task for a given tool, as well as its performance.

The files would be named

taskname_tool_submittername.Rmd/ipynb

and would include timings at each stage, as output, plus a table of timings for the task to be written to a common file.

Each participant could submit any number of trials based on any number of different tools, and would knit this into a .md file or in the Jupyter case, simply execute all the code in the file.

The week before the contest, someone would re-run all of the files on a single machine, and push the results back to the repo, so that the performance comparisons took place on the same hardware.

We could consider options for parallelization - say, single-thread versus all available threads - and perhaps run each trial multiple times and average the timings.