Open sayanbiswas59 opened 4 weeks ago
The underlying format of the returned batch from __call__
should still be Dict[str, np.ndarray]
. Likely, the output batch is being interpreted as a different format, which is causing the batches to look weird.
What happened + What you expected to happen
We are attempting to utilize Ray v2.23 for batch inferencing, specifically on multi-modal data, by leveraging LMMs.
We have observed an issue where, if an exception arises while executing an item in the batch, the pending items from the current batch accumulate in the next batch of the succeeding task. This causes subsequent tasks to fail due to overflow. Can anyone identify what we might be overlooking?
We are trying to find a solution that allows us to skip the problematic item in the batch and proceed with processing the remaining items. While we have considered skipping the entire batch if an exception occurs, this does not resolve the overflow issue.
Below is an example of the Actor log for which we observe the batch overflow when there is an exception. Batch size = 1000 Exception occurs while processing the task ---> 125/960 and then it starts to overflow.
Versions / Dependencies
Reproduction script
Issue Severity
High: It blocks me from completing my task.