nytimes / ingredient-phrase-tagger

Extract structured data from ingredient phrases using conditional random fields
http://open.blogs.nytimes.com/2016/04/27/structured-ingredients-data-tagging/
Other
785 stars 237 forks source link

Updates for speed and python 3 compatibility #7

Open dexteradeus opened 8 years ago

dexteradeus commented 8 years ago

As I started to use this system, I started making changes which I think could be useful to others.

  1. Updates to some scripts to improve python 2/3 compatibility
  2. Fixed a formatting bug in in the training file output to support running crf_learn with multiple threads
  3. Refactored crf file generation to support multithreading
  4. Updated roundtrip.sh to support providing counts as command line options and to use all system cores when generating data files as well as running crf_learn

On my system with 8 cores, I noticed a 7.5x reduction in processing time to run roundup.sh with the provided dataset.

walkerdb commented 7 years ago

I can confirm this works when set up correctly. On macs the code to get a processor count will fail (line 4 of roundtrip.sh), but it is easy to hardcode a number.

maugch commented 7 years ago

there is an error on rountrip.sh line 42 input_file instead of iput_file

also it seems not to generate test data def _generate_data_worker is never called. tested on ubuntu 16.04