Fix streaming input for batch and multiple sources

amitschang commented 1 month ago

This fixes the behavior of multiple source tasks to match that of a branched task (broadcast outputs to children). It also adds support for generated input of length > 1 to be handled as expected included possibly splitting the batch.

Resolves https://github.com/ssec-jhu/dplutils/issues/60 Resolves https://github.com/ssec-jhu/dplutils/issues/76

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (f4a82f2) to head (118a5d6).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #99 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 11 11 Lines 595 602 +7 ========================================= + Hits 595 602 +7 ``` | [Flag](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/99/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/99/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

xiangchenjhu commented 3 weeks ago

I have an overall question about this PR and a minor comment about the sentence change.

If my understanding is correct, this PR (#60) changes the task processing mode from processing batches separately to processing them duplicately.

Considerations: Separate Mode: Previously, each task processed different batches independently. Duplicate Mode: Now, all tasks receive and process the same batch of data.

My Questions: Why is this change necessary? Will the duplicate mode cause redundant calculations? Should we keep both modes to handle different use cases? For example, in the bluephos pipeline, we use separate mode? Thank you for the clarification!

amitschang commented 3 weeks ago

@xiangchenjhu, thanks for looking. I will try to clarify: There are two aspects which are so closely related they come in this one PR:

The first is only relevant to pipelines that have multiple sources, for example if the pipeline looks like:
```
A   
\____ C
/
B
```
prior to this change the generator would call next twice (once for A and once for B), but what we desire is that each task A and B get all input data. The change makes it so next is called once and that batch goes to both. It matches the behavior of a pipeline that branches after the first input task.
The second effect is that input batch size will be correctly processed. Before this change, if the source generator made dataframes of len 2, for example, and the first task has batch_size set to 2, it would end up getting a dataframe of len 4. The fix makes it so it would correctly get the len 2 dataframe.

I believe that this change has no effect on the bluephos pipeline as it currently is setup: there is only one source task, and the generated dataframes are len 1, so same behavior as before will be exhibited.

ssec-jhu / dplutils

Fix streaming input for batch and multiple sources #99

Codecov Report