rjurney / Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
http://bit.ly/agile_data_science
MIT License
456 stars 306 forks source link

rewrite the instruction comment to provide better guidance #86

Closed pjhinton closed 5 years ago

pjhinton commented 5 years ago

This commit provides a major rewrite of the instruction comment at the heading of pyspark_mongodb.py. There are several issues with the original:

1) An environment variable assignment and the command to launch pyspark are displayed on a single line with continuation characters, which gives the false impression that the environment variable is to be set to the whole thing.

2) The paths to the JAR files are expressed relative to the directory containing the script, which means that the author presumes that pyspark is being launched from that directory. This contradicts the guidance provided in the subsection "Running the Code". Indeed if one does the launch from the ch02 directory, you will get a file-not-found error for the CSV because the path is relative to the parent directory.

3) The JAR files have versions that may not match the environment set up by the bootstrap.sh script.

4) It is not clear if this script is to be run standalone or executed line-by-line in a pyspark session.

The rewrite of the comment provides more background, alerting the user that the script is to be run in a pyspark session. The environment variable assignment is separated from the pyspark launch command. The reader is encouraged to check the versions of the JAR files and then set the versions in environment variables to reduce the chance of typos. The reader is also advised that the script is to be run in the top-level directory of the Git repository.

rjurney commented 5 years ago

Thanks so much!