OOM while running with a large number of HDFS files

In our environment, we tried to run with a large amount of HDFS dataset files and we found OOM issue.

From the page https://beam.apache.org/documentation/runners/direct/ It seems like the direct runner will load all dataset into memory and this is probably why we are falling.

This is a high-priority task for us to make ray beam runner work in our production environment.

ray-project / ray_beam_runner