Closed aslotnick closed 2 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 61.26%. Comparing base (
6d755f9
) to head (bfcec57
). Report is 15 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
When writing to sparkey,
allShards
represents every expected shard even if there is no corresponding data inshards
for that shard number.shards.rightOuterJoin(allShards)
(added in https://github.com/spotify/scio/pull/5208) fails when a shard contains large amounts of data, leading to the error described in https://github.com/spotify/scio/issues/5300:java.lang.OutOfMemoryError: Required array length 2147483639 + 15534 is too large
.This PR replaces
rightOuterJoin
withhashFullOuterJoin
(note that there is nohashRightOuterJoin
implementation). A hash join is a good fit because the right-hand side contains very little data (only the keys of the shards) and it doesn't need to use an array to represent the large left-hand side's values. As a result, some failing workflows that succeeded in Scio 0.13.* will run successfully again.