ufal / treex

Treex NLP framework
33 stars 6 forks source link

treex -p & Write to=file #64

Closed ptakopysk closed 7 years ago

ptakopysk commented 7 years ago

When you use treex --parallel, you cannot use a Write block that writes everything into one file (i.e. treex -p does not do the merging of the outputs for you) -- you have to write the outputs of individual jobs into individual files, and then cat them together once the whole treex run is finished.

This is probably fine, but it should be at least said somewhere in some documentation, and/or, even better, checked automatically and immediately reported to the user by a log_fatal, explaining that to=file is not compatible with treex -p. Currently, treex starts the scenario OK, it runs OK at the beginning, and only after some time the jobs die with a non-informative error: Treex::Core::Parallel::Node::ANON("Can't use an undefined value as a symbol reference...

Please also note that when running treex as a single process, it is perfectly fine to read in inputs from a list of files and then write the outputs "merged" into one file.

martinpopel commented 7 years ago

It is supported even with treex -p, but the file must be stdout:

seq 100 | treex -p --jobs 10 Read::Sentences lines_per_doc=1 Write::Sentences > out
diff <(seq 100) out
less 001-cluster-run-*/processing_info.log

(The last line is to check that all the 10 machines were involved. If not you can add Util::Eval doc='sleep 10'.)

That said, I would not recommend merging parallel jobs' outputs into one file for big parallel jobs in practice. One job may die/freeze and all the rest of the output will be lost (treex -p makes sure the order of documents is preserved).

I agree the error message can be more informative and it should fail as soon as possible.