Closed gaow closed 11 months ago
This is how group_by
is processed. However, the reason why your code is slow is likely because you are returning file objects from recover_structure
, for which sos has to look for the index of the objects each time, which is not particularly efficient when the step_input
list is long.
Please try to change
iter_lst = iter(lst)
to
iter_lst = iter(range(len(lst)))
and see if this is the case.
Also, please test the branch issue1526, which tries to optimize this use case, although the performance is likely still slower than returning indexes directly.
Assuming this is fixed.
Here is a example script:
Here is the output of running this script:
As you can see I basically want to group the input exactly by the same data structure it came in. Currently what SoS does is to first flatten the
data
object into a plain list to take asinput
, then I have to use a lambda function to add back the structure to it. The way this is implemented takes 3hrs to process the input of >60K flattened elements , before it can run any jobs. This seems a bit dumb.Is there a more elegant way to achieve it?