wrap_tokenize is not working with multithead tokenizer.perl

lvapeab / m4loc

Automatically exported from code.google.com/p/m4loc

GNU Lesser General Public License v3.0

0 stars 0 forks source link

wrap_tokenize is not working with multithead tokenizer.perl #40

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

perl wrap_tokenizer.pm -t "perl 
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 2" < try

is not working, but

perl wrap_tokenizer.pm -t "perl 
/home/larix/moses/scripts/tokenizer/tokenizer.perl" -p "-l cs -threads 1" < try

Gives correct output. 
Problem is in parameter -thread, for more threads it is not working (only for 
deafult -thread 1)

Original issue reported on code.google.com by xhu...@gmail.com on 29 Apr 2013 at 3:41

GoogleCodeExporter commented 9 years ago

This issue was closed by revision c7655591f003.

Original comment by Achi...@gmail.com on 6 Sep 2013 at 4:56

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Accidentially closed with checkin.

Original comment by Achi...@gmail.com on 6 Sep 2013 at 4:58

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 6 Sep 2013 at 5:02

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 25 Sep 2013 at 7:38

Added labels: Milestone-v0.9.3

GoogleCodeExporter commented 9 years ago

Well, multithreaded version of tokenizer.perl works in batches, so it tries to 
read a couple of lines into array and then process them in separate threads.
Our wrap_tokenizer.pm/tokenize_str function sends just single line, and then 
waits for the response (see lines 376-377), but since the multithread version 
needs more strings, before it returns first result, they both block each other.

Thus it involves major modification to change the logic in wrap_tokenizer.pm.

Original comment by toma...@moravia.com on 14 Mar 2014 at 2:18

GoogleCodeExporter commented 9 years ago

If considering major modifications in the logic of wrap_tokenizer.pm (also 
wrap_detokenizer.pm), please keep possible support for Windows in mind.
Possible ideas:
* tokenizer pre-processing and post-processing (e.g. markup placeholders)
* fixing up markup after tokenizer messes it up (similar to fix_markup_ws.pm)
* fork tokenizer/detokenizer or add option to main branch (see new ignore 
option in tokenizer)

Original comment by Achi...@gmail.com on 16 Mar 2014 at 4:24