vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

Retain the structure of variable for `input` grouping #1526

Closed gaow closed 11 months ago

gaow commented 11 months ago

Here is a example script:

[global]

def recover_structure(lst, data):
    iter_lst = iter(lst)
    return [[next(iter_lst) for _ in sublist] for sublist in data]

data = [['1.txt'], ['2.txt', '3.txt'], ['4.txt', '5.txt', '6.txt'], ['7.txt', '8.txt', '9.txt', '10.txt']]

[1]
output: data
bash: expand = True
 touch {_output}

[2]
input: group_by = lambda x: recover_structure(x, data)
print(_index, _input)

Here is the output of running this script:

$ sos run test.sos 

INFO: Running 1: 
INFO: 1 is completed.
INFO: 1 output:   1.txt 2.txt... (10 items)
INFO: Running 2: 
0 1.txt
1 2.txt 3.txt
INFO: 2 (index=0) is completed.
INFO: 2 (index=1) is completed.
3 7.txt 8.txt 9.txt 10.txt
INFO: 2 (index=3) is completed.
2 4.txt 5.txt 6.txt
INFO: 2 (index=2) is completed.
INFO: Workflow default (ID=w3d2c3f1eacdfd48c) is executed successfully with 2 completed steps and 5 completed substeps.

As you can see I basically want to group the input exactly by the same data structure it came in. Currently what SoS does is to first flatten the data object into a plain list to take as input, then I have to use a lambda function to add back the structure to it. The way this is implemented takes 3hrs to process the input of >60K flattened elements , before it can run any jobs. This seems a bit dumb.

Is there a more elegant way to achieve it?

BoPeng commented 11 months ago

This is how group_by is processed. However, the reason why your code is slow is likely because you are returning file objects from recover_structure, for which sos has to look for the index of the objects each time, which is not particularly efficient when the step_input list is long.

Please try to change

iter_lst = iter(lst)

to

iter_lst = iter(range(len(lst)))

and see if this is the case.

BoPeng commented 11 months ago

Also, please test the branch issue1526, which tries to optimize this use case, although the performance is likely still slower than returning indexes directly.

BoPeng commented 11 months ago

Assuming this is fixed.