joergen7 / cuneiform

Cuneiform distributed programming language
https://cuneiform-lang.org/
Apache License 2.0
232 stars 16 forks source link

INLINE tasks -- feature suggestion #43

Closed iguberman closed 4 years ago

iguberman commented 8 years ago

Allow a task to be inline, meaning this task will execute immediately on the same machine without scheduling to a remote worker, for little util bash scripts, or for joins of large files that aren't worth transferring to a worker somewhere just to join, and such. Also, if there are a lot of tiny tasks, they might overwhelm the scheduler unnecessarily (In condor it's easy to implement just use local universe for inline tasks).

i.e.:

deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{

OR

deftask inline gen-xx-sequence( <items(String)> : last ) in bash *{

OR maybe?

definline gen-xx-sequence( <items(String)> : last ) in bash *{

Complete script example with two very different use cases:

#
# This task is way too tiny to be worthy of scheduling to a worker 
#
deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{
items=`for item in \`seq 0 $last\`; do printf '%02d\n' $item; done`
}*

#
# This task is the one worth distributing on the workers
#
deftask heavy-crunch( out(File) : column1 column2 column3) in bash *{
out=processout.txt
heavy-crunch.exe $column1 $column2 $column3 > $out
}*

#
# This task join-files is not so small, but I want it executing locally to avoid transferring
# supposedly a bunch of huge files to a worker somewhere just to join them.
#
# It was actually a real problem in our prod environment, which I fixed by defaulting to 
#  `local` universe if the total file_transfer_size is > nG -- very kludgy workaround, currently in 
# Java version of cf
#
#  Though I would be careful with this particular use case as it might block the scheduler
#  completely, it doesn't invalidate this feature suggestion :) 
#
deftask join-files( out( File ) : <files(File)> ) in inline bash *{
out=joinout.txt
cat ${files[@]} > $out
}*

files = heavy-crunch( column1 : gen-xx-sequence(last : 4 ), column2 : gen-xx-sequence( last : 3 ), column3 : gen-xx-sequence( last : 2));
result = join-files( files : files );
result;
joergen7 commented 8 years ago

How about an additional declaration statement:

declare inline : join-files gen-xx-sequence;

This way, task definitions and the details how to execute them are separated and one can see at a glance what is inlined and what is not. Also, you could easily come up with more annotation types and still leave the workflow script uncluttered.

iguberman commented 8 years ago

Sounds great! Especially, I agree that more things might come up and it's better to keep them separate. I could declare them right next to the task if I wanted to, like below, right? Though it is a minor detail about how to organize your code. The important thing is the actual presence of this functionality.

deftask gen-xx-sequence( <items(String)> : last ) in inline bash *{
items=`for item in \`seq 0 $last\`; do printf '%02d\n' $item; done`
}*
declare inline : gen-xx-sequence;
joergen7 commented 8 years ago

Placement and even order wouldn't matter, as always. And having multiple declare-inlines should also be no problem. As for the actual presence of the feature, I'll have to bring up basic Condor support in the Erlang version first.

joergen7 commented 4 years ago

This feature would break some of the assumptions a Cuneiform function makes:

If the script is really only tiny, (and it does not read or write to/from the distributed file system) it will still be fast enough, assuming you use the current Cuneiform scheduler. If you have a lot of these tiny scripts, then I suggest picking a larger granularity level for Cuneiform foreign functions.