Add a way to run DataWriter jobs in an isolated environment
Currently, VPJ is written in a way where it launches the compute tasks from it's local environment. If an environment only allows running Spark jobs via spark-submit, the entire VPJ logic needs to run on the Spark driver, making the driver non-idempotent.
With this change, we can separate the VPJ driver logic from Spark compute environment, and launch only the compute tasks on Spark. The driver on Spark is a very thin wrapper that only parses CLI args and launches the compute jobs. This makes the Spark compute job idempotent as well, and can improve the resiliency.
Another benefit of this change is that this improves the debugging experience as the logs can now be viewed in the environment where the user's job was triggered, and they don't have to check the logs on the Spark driver.
This change is implemented as an implementation of the DataWriterComputeJob interface, and all interactions with external systems are contained within this interface. This change serializes the job properties and job configs as CLI args and passes them to the isolated environment. The main class in the isolated environment parses these CLI args and configures the actual compute job. At the end of the compute job, the driver program on the isolated environment serializes the DataWriterTaskTracker to HDFS, and the VPJ driver program reads the same file from HDFS and returns it to VPJ to perform further validation and job polling.
How was this PR tested?
Tested manually, and in integration tests. More testing is in progress. Unit tests need to be added
Does this PR introduce any user-facing changes?
[X] No. You can skip the rest of this section.
[ ] Yes. Make sure to explain your proposed changes and call out the behavior change.
Add a way to run DataWriter jobs in an isolated environment
Currently, VPJ is written in a way where it launches the compute tasks from it's local environment. If an environment only allows running Spark jobs via
spark-submit
, the entire VPJ logic needs to run on the Spark driver, making the driver non-idempotent.With this change, we can separate the VPJ driver logic from Spark compute environment, and launch only the compute tasks on Spark. The driver on Spark is a very thin wrapper that only parses CLI args and launches the compute jobs. This makes the Spark compute job idempotent as well, and can improve the resiliency.
Another benefit of this change is that this improves the debugging experience as the logs can now be viewed in the environment where the user's job was triggered, and they don't have to check the logs on the Spark driver.
This change is implemented as an implementation of the
DataWriterComputeJob
interface, and all interactions with external systems are contained within this interface. This change serializes the job properties and job configs as CLI args and passes them to the isolated environment. The main class in the isolated environment parses these CLI args and configures the actual compute job. At the end of the compute job, the driver program on the isolated environment serializes theDataWriterTaskTracker
to HDFS, and the VPJ driver program reads the same file from HDFS and returns it to VPJ to perform further validation and job polling.How was this PR tested?
Tested manually, and in integration tests. More testing is in progress. Unit tests need to be added
Does this PR introduce any user-facing changes?