Maybe a quantisation process for the sort buckets might make the algorithm more efficient.
Alternatively, adding an additional pass to determine the size of each bucket by counting how many times a given index gets hit by the sort-mapping algorithm will allow us to map sorts more directly into their position without needing a temporary vector of vectors (this approach would still require sub-sorting, however, although we could keep track of whether numbers to be added into it are in-sequence and use this a shortcut to prevent unneeded sub-sorting).
Allocate a vector of same size as sort array
Create a map of int->int which will track how many times each sort position is to be used
Run an additional pass on the sort array after stats-deriving phase to determine how many times each sort position is used, using the map created in the previous step to track them
Modify mapping algorithm to take the sorts map into account when mapping data, so that there are no "gaps" in the data when sorted on a first-pass.
The map is used to keep track of the sort "buckets" and can be used to iterate over them for any further sort passes that are needed
Extension step:
Determine which buckets will and will not be already self-sorted when counting their size, by remembering the last value put in each and comparing each successive value with this one to determine if sorting is preserved across all insertions. This reduces the need to check for buckets sortedness when running second-pass.
Extension step: