vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
269 stars 45 forks source link

Retain the structure of variable for `input` grouping #1526

Closed gaow closed 7 months ago

gaow commented 7 months ago

Here is a example script:

[global]

def recover_structure(lst, data):
    iter_lst = iter(lst)
    return [[next(iter_lst) for _ in sublist] for sublist in data]

data = [['1.txt'], ['2.txt', '3.txt'], ['4.txt', '5.txt', '6.txt'], ['7.txt', '8.txt', '9.txt', '10.txt']]

[1]
output: data
bash: expand = True
 touch {_output}

[2]
input: group_by = lambda x: recover_structure(x, data)
print(_index, _input)

Here is the output of running this script:

$ sos run test.sos 

INFO: Running 1: 
INFO: 1 is completed.
INFO: 1 output:   1.txt 2.txt... (10 items)
INFO: Running 2: 
0 1.txt
1 2.txt 3.txt
INFO: 2 (index=0) is completed.
INFO: 2 (index=1) is completed.
3 7.txt 8.txt 9.txt 10.txt
INFO: 2 (index=3) is completed.
2 4.txt 5.txt 6.txt
INFO: 2 (index=2) is completed.
INFO: Workflow default (ID=w3d2c3f1eacdfd48c) is executed successfully with 2 completed steps and 5 completed substeps.

As you can see I basically want to group the input exactly by the same data structure it came in. Currently what SoS does is to first flatten the data object into a plain list to take as input, then I have to use a lambda function to add back the structure to it. The way this is implemented takes 3hrs to process the input of >60K flattened elements , before it can run any jobs. This seems a bit dumb.

Is there a more elegant way to achieve it?

BoPeng commented 7 months ago

This is how group_by is processed. However, the reason why your code is slow is likely because you are returning file objects from recover_structure, for which sos has to look for the index of the objects each time, which is not particularly efficient when the step_input list is long.

Please try to change

iter_lst = iter(lst)

to

iter_lst = iter(range(len(lst)))

and see if this is the case.

BoPeng commented 7 months ago

Also, please test the branch issue1526, which tries to optimize this use case, although the performance is likely still slower than returning indexes directly.

BoPeng commented 7 months ago

Assuming this is fixed.