ssec-jhu / dplutils

Distributed(Data) Pipeline Uitilities
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Tag outputs with task from which they came #66

Closed amitschang closed 4 months ago

amitschang commented 5 months ago

To enable partitioning outputs by the task that emitted them. This comes with a breaking change in the interface to executor, since it now emits OutputBatch instead of dataframe, but this seems low risk at the moment, and any users of writeto will remain back-compatible. The OutputBatch structure (as opposed to tuple, for instance) was chosen to enable future metadata additions.

Resolves https://github.com/ssec-jhu/dplutils/issues/61

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 100.00%. Comparing base (8921afa) to head (583c41f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #66 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 11 11 Lines 558 574 +16 ========================================= + Hits 558 574 +16 ``` | [Flag](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/66/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/ssec-jhu/dplutils/pull/66/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu) | `100.00% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ssec-jhu#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

xiangchenjhu commented 5 months ago

Is it necessary to modify our task-specific input/output routines to work with the new OutputBatch format instead of data frames?

amitschang commented 5 months ago

@xiangchenjhu, nope - this only affects what gets yielded at the end of the pipeline. So iso

for df in pl.run():
   df.to_csv(...)

for example, one would instead:

for batch in pl.run():
  batch.data.to_csv(...)

Only for run, writeto is back-compatible.

xiangchenjhu commented 5 months ago

Understood, so this change impacts only the end-of-pipeline usage, while the data flow within each task (both input and output) remains as DataFrames.