mesosphere / spark-build

Used to build the mesosphere/spark docker image and the DC/OS Spark package
52 stars 34 forks source link

[DCOS-59720] Introduce Spark Job for MWT #556

Closed alembiewski closed 5 years ago

alembiewski commented 5 years ago

What changes were proposed in this pull request?

Resolves DCOS-59720 [DS] [Spark Operator] Create a better Spark Job for MWT

This PR introduces two Spark applications: 1) DatasetGenerator - creates a dataset with specified record count and record size and writes the result on s3 bucket. 2) DatasetSort - reads data from s3 location and perform sort operation on the obtained Dataframe

How were these changes tested?

Release Notes

n/a

akirillov commented 5 years ago

Thanks, @alembiewski. Sorter and Generator look good and were battle-tested during scale-tests. However, we have CI tests failing and it looks like the failures caused by hadoop-aws version bump which deprecates s3n URLs. The culprit seems to be here: https://github.com/mesosphere/spark-build/blob/master/spark-testing/spark_s3.py#L9-L16. So we need to switch to s3a URLs and modify tests as needed.