josephmachado / efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course
https://josephmachado.podia.com/efficient-data-processing-in-spark
243 stars 54 forks source link

setup workspace with devcontainer #5

Closed luutuankiet closed 5 months ago

luutuankiet commented 5 months ago

added devcontainer config & setup scripts to spin a host devcontainer matching the master spark's environment in efficient_data_processing_spark/data-processing-spark/1-lab-setup/containers/spark.

The devcontainer will be able to perform imports and run arbitrary code for further exploration if needed :

josephmachado commented 5 months ago

Wow thank you for the great work @luutuankiet

I am wondering if it would be possible to add any tests for this?

luutuankiet commented 5 months ago

@josephmachado good catch, I don't have much experience writing these kinds of test for container but I'll give it a shot!

For now I am leaning to a workflow that builds a container from the devcontainer.json instructions, then run the primary make commands from the course for example make up, make setup etc. Test should be successful if the make commands run as expected, but I think most of the make commands are docker exec/ run that don't really return an exit code to assert... I'll do some digging & let you know if this can be achieved. Else, let's reject & close this PR if t don't make much sense test-wise.

For now you can test it yourself by opening the commit in a codespace. It actually helped me debug a network related issue when I tried to setup the lab.

josephmachado commented 5 months ago

@josephmachado good catch, I don't have much experience writing these kinds of test for container but I'll give it a shot!

For now I am leaning to a workflow that builds a container from the devcontainer.json instructions, then run the primary make commands from the course for example make up, make setup etc. Test should be successful if the make commands run as expected, but I think most of the make commands are docker exec/ run that don't really return an exit code to assert... I'll do some digging & let you know if this can be achieved. Else, let's reject & close this PR if t don't make much sense test-wise.

For now you can test it yourself by opening the commit in a codespace. It actually helped me debug a network related issue when I tried to setup the lab.

Ah I should clarify here @luutuankiet If you can record a video or add instructions on how to use this, me and others can benefit from it.

With the instructions I'll try to recreate it (locally and on codespaces) and if that works, we should be good to go.

luutuankiet commented 5 months ago

@josephmachado gotcha, please find the instructions below:

  1. clone the repo to local, run VS code command rebuild and reopen in container or build and reopen in container to spin up the host container image

Alternatively, head over to my branch and hit open in codespace which will spin up a host conatiner on github codespace.

image
  1. wait for the container to finish building. The first build will take some time as I've bundle a couple of vs code extensions in devcontainer.json and utils in postCreateCommand.sh. Feel free to comment out features you don't need as long as it
    • is in the customizations.extensions block for devcontainer.json. ("customizations.settings" is required for the paths to work)
    • is not related to PYTHON in the postCreateCommand.sh

(other files such as env_init.sh, source_env.sh are also required to be kept as is for the scripts to source and invoked correctly. Also, run the vs code command "Reload window" if after finished building the containers vs code shows any extensions errors.)

  1. once the container is up, run make up and other make commands to test the setup.

    image
  2. the devcontainer should now show context definitions on hovering any code in the data-processing-spark folder :

devcontainer

A closing note, I find this setup beneficial for someone who likes to get their hands dirty by exploring the source code and understand how to rebuild them altogether. The Make commands are intuitive as it is but not without its own limitation as they're wrapped by docker exec/run commands which abstracts away the code flow.

josephmachado commented 5 months ago

This is great, TY