We are happy to present the new 2.52.0 release of Beam.
This release includes both improvements and new functionality.
See the download page for this release.
Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, use beam-sdks-java-extensions-avro instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. (#25252).
Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.
New Features / Improvements
Add UseDataStreamForBatch pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.
upload_graph as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.
state amd side input cache has been enabled to a default of 100 MB. Use --max_cache_memory_usage_mb=X to provide cache size for the user state API and side inputs. (Python) (#28770).
Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.
Breaking Changes
org.apache.beam.sdk.io.CountingSource.CounterMark uses custom CounterMarkCoder as a default coder since all Avro-dependent
classes finally moved to extensions/avro. In case if it's still required to use AvroCoder for CounterMark, then,
as a workaround, a copy of "old" CountingSource class should be placed into a project code and used directly
(#25252).
Renamed host to firestoreHost in FirestoreOptions to avoid potential conflict of command line arguments (Java) (#29201).
Bugfixes
Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
MLTransform doesn't output artifacts such as min, max and quantiles. Instead, MLTransform will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the eariler MLTransform, use read_artifact_location of MLTransform, which reads artifacts that were produced earlier in a different MLTransform (#29016)
Fixed a memory leak, which affected some long-running Python pipelines: #28246.
Previously deprecated Avro-dependent code (Beam Release 2.46.0) has been finally removed from Java SDK "core" package.
Please, use beam-sdks-java-extensions-avro instead. This will allow to easily update Avro version in user code without
potential breaking changes in Beam "core" since the Beam Avro extension already supports the latest Avro versions and
should handle this. (#25252).
Publishing Java 21 SDK container images now supported as part of Apache Beam release process. (#28120)
Direct Runner and Dataflow Runner support running pipelines on Java21 (experimental until tests fully setup). For other runners (Flink, Spark, Samza, etc) support status depend on runner projects.
New Features / Improvements
Add UseDataStreamForBatch pipeline option to the Flink runner. When it is set to true, Flink runner will run batch
jobs using the DataStream API. By default the option is set to false, so the batch jobs are still executed
using the DataSet API.
upload_graph as one of the Experiments options for DataflowRunner is no longer required when the graph is larger than 10MB for Java SDK (PR#28621.
state amd side input cache has been enabled to a default of 100 MB. Use --max_cache_memory_usage_mb=X to provide cache size for the user state API and side inputs. (Python) (#28770).
Beam YAML stable release. Beam pipelines can now be written using YAML and leverage the Beam YAML framework which includes a preliminary set of IO's and turnkey transforms. More information can be found in the YAML root folder and in the README.
Breaking Changes
org.apache.beam.sdk.io.CountingSource.CounterMark uses custom CounterMarkCoder as a default coder since all Avro-dependent
classes finally moved to extensions/avro. In case if it's still required to use AvroCoder for CounterMark, then,
as a workaround, a copy of "old" CountingSource class should be placed into a project code and used directly
(#25252).
Renamed host to firestoreHost in FirestoreOptions to avoid potential conflict of command line arguments (Java) (#29201).
Bugfixes
Fixed "Desired bundle size 0 bytes must be greater than 0" in Java SDK's BigtableIO.BigtableSource when you have more cores than bytes to read (Java) #28793.
MLTransform doesn't output artifacts such as min, max and quantiles. Instead, MLTransform will add a feature to output these artifacts as human readable format - #29017. For now, to use the artifacts such as min and max that were produced by the eariler MLTransform, use read_artifact_location of MLTransform, which reads artifacts that were produced earlier in a different MLTransform (#29016)
Fixed a memory leak, which affected some long-running Python pipelines: #28246.
In Python, the VertexAIModelHandlerJSON now supports passing in inference_args. These will be passed through to the Vertex endpoint as parameters.
Added support to run mypy on user pipelines (#27906)
Python SDK worker start-up logs and crash logs are now captured by a buffer and logged at appropriate levels via Beam logging API. Dataflow Runner users might observe that most worker-startup log content is now captured by the worker logger. Users who relied on print() statements for logging might notice that some logs don't flush before pipeline succeeds - we strongly advise to use logging package instead of print() statements for logging. (#28317)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Bumps org.apache.beam:beam-sdks-java-google-cloud-platform-bom from 2.42.0 to 2.52.0.
Release notes
Sourced from org.apache.beam:beam-sdks-java-google-cloud-platform-bom's releases.
... (truncated)
Changelog
Sourced from org.apache.beam:beam-sdks-java-google-cloud-platform-bom's changelog.
... (truncated)
Commits
7c8a997
Set version for 2.52.0 RC555e7d19
Mark Avro as provided in harness JAR (#29412) (#29415)d9248c8
Fix MongoDBIO ignore SSL KeyStore not initialized (#29403)8b6ae4a
Invoke pyarrow_hotfix to alleviate concerns due to CVE-2023-47248. (#29402)0fdf404
Revert "Support DatabaseID in Datastore beam connector (#27987)" (#29318)91238ee
fix typo in workflow (#29300) (#29301)5b5289c
Declare signMavenJavaPublication dependency on copyPom (#29298)e008971
[YAML] Fix PyPi caching for non-dev beam (#29292)058c947
[YAML] fix renaming provider caching and YAML schema validation (#29290)0e8e54c
[YAML] Fix error handling for KafkaSchemaTransforms (#29261) (#29289)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase
.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show