ssec-jhu / dplutils

Distributed(Data) Pipeline Uitilities
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Fix streaming input for batch and multiple sources #99

Closed amitschang closed 3 weeks ago

amitschang commented 1 month ago

This fixes the behavior of multiple source tasks to match that of a branched task (broadcast outputs to children). It also adds support for generated input of length > 1 to be handled as expected included possibly splitting the batch.

Resolves https://github.com/ssec-jhu/dplutils/issues/60 Resolves https://github.com/ssec-jhu/dplutils/issues/76

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (f4a82f2) to head (118a5d6).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #99 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 11 11 Lines 595 602 +7 ========================================= + Hits 595 602 +7 ``` | [Flag](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/99/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/99/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

xiangchenjhu commented 3 weeks ago

I have an overall question about this PR and a minor comment about the sentence change.

If my understanding is correct, this PR (#60) changes the task processing mode from processing batches separately to processing them duplicately.

Considerations: Separate Mode: Previously, each task processed different batches independently. Duplicate Mode: Now, all tasks receive and process the same batch of data.

My Questions: Why is this change necessary? Will the duplicate mode cause redundant calculations? Should we keep both modes to handle different use cases? For example, in the bluephos pipeline, we use separate mode? Thank you for the clarification!

amitschang commented 3 weeks ago

@xiangchenjhu, thanks for looking. I will try to clarify: There are two aspects which are so closely related they come in this one PR:

I believe that this change has no effect on the bluephos pipeline as it currently is setup: there is only one source task, and the generated dataframes are len 1, so same behavior as before will be exhibited.