Open joapolarbear opened 2 years ago
Because the one-queue method can NOT generate correct topological sorting in some cases. For example, at time point t_0, worker A finishes op A_0, making A_1, A_2 ready to run; at time point t_1 > t_0, B_1 and B_2 at worker B are ready. Suppose A_1 and A_2 correspond to two communication ops C_1 and C_2, respectively, B_1 and B_2 correspond to C_3, C_4 respectively. If there is only one scheduling queue, A_1 and A_2 will be scheduled before B_1 and B_2, since the ready time of A_1 and A_2, t_0, is smaller than that of B_1 and B_2. It results in the communication ops being scheduled in the order of C_1, C_2, C_3, and C_4. However, in reality, worker B may finish B_1 before worker A finishes op A_2, resulting in the order of C_1, C_3, C_2, and C_4. Therefore, the one-queue method may generate incorrect topological sorting, leading to large simulation errors. To solve this problem, dPRO maintains a scheduling queue for each worker. In the case above, after worker A finishes A_1, worker B has a smaller "device time" (end time of the last scheduled op), so worker B will pop one op from its queue, that is B_1, making B_1 is scheduled before A_2.
dPRO's replayer uses a multi-frontier topological sorting algorithm. Why not just use one global queue as in Daydream.