INLINE tasks -- feature suggestion

iguberman commented 8 years ago

Allow a task to be inline, meaning this task will execute immediately on the same machine without scheduling to a remote worker, for little util bash scripts, or for joins of large files that aren't worth transferring to a worker somewhere just to join, and such. Also, if there are a lot of tiny tasks, they might overwhelm the scheduler unnecessarily (In condor it's easy to implement just use local universe for inline tasks).

i.e.:

deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{

OR

deftask inline gen-xx-sequence( <items(String)> : last ) in bash *{

OR maybe?

definline gen-xx-sequence( <items(String)> : last ) in bash *{

Complete script example with two very different use cases:

#
# This task is way too tiny to be worthy of scheduling to a worker 
#
deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{
items=`for item in \`seq 0 $last\`; do printf '%02d\n' $item; done`
}*

#
# This task is the one worth distributing on the workers
#
deftask heavy-crunch( out(File) : column1 column2 column3) in bash *{
out=processout.txt
heavy-crunch.exe $column1 $column2 $column3 > $out
}*

#
# This task join-files is not so small, but I want it executing locally to avoid transferring
# supposedly a bunch of huge files to a worker somewhere just to join them.
#
# It was actually a real problem in our prod environment, which I fixed by defaulting to 
#  `local` universe if the total file_transfer_size is > nG -- very kludgy workaround, currently in 
# Java version of cf
#
#  Though I would be careful with this particular use case as it might block the scheduler
#  completely, it doesn't invalidate this feature suggestion :) 
#
deftask join-files( out( File ) : <files(File)> ) in inline bash *{
out=joinout.txt
cat ${files[@]} > $out
}*

files = heavy-crunch( column1 : gen-xx-sequence(last : 4 ), column2 : gen-xx-sequence( last : 3 ), column3 : gen-xx-sequence( last : 2));
result = join-files( files : files );
result;

joergen7 commented 8 years ago

How about an additional declaration statement:

declare inline : join-files gen-xx-sequence;

This way, task definitions and the details how to execute them are separated and one can see at a glance what is inlined and what is not. Also, you could easily come up with more annotation types and still leave the workflow script uncluttered.

iguberman commented 8 years ago

Sounds great! Especially, I agree that more things might come up and it's better to keep them separate. I could declare them right next to the task if I wanted to, like below, right? Though it is a minor detail about how to organize your code. The important thing is the actual presence of this functionality.

deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{
items=`for item in \`seq 0 $last\`; do printf '%02d\n' $item; done`
}*
declare inline : gen-xx-sequence;

joergen7 commented 8 years ago

Placement and even order wouldn't matter, as always. And having multiple declare-inlines should also be no problem. As for the actual presence of the feature, I'll have to bring up basic Condor support in the Erlang version first.

joergen7 commented 4 years ago

This feature would break some of the assumptions a Cuneiform function makes:

it should not matter where in a cluster a function is executed
there should be no side effects (like a printf)
it should not matter how often a script is executed (I guess you assume the printf gets called only once, but there is no guarantee for that)

If the script is really only tiny, (and it does not read or write to/from the distributed file system) it will still be fast enough, assuming you use the current Cuneiform scheduler. If you have a lot of these tiny scripts, then I suggest picking a larger granularity level for Cuneiform foreign functions.

joergen7 / cuneiform

INLINE tasks -- feature suggestion #43