Closed joelansbro closed 2 years ago
Running PySpark has lead to further considerations around the overall project structure. PySpark is meant to run in a distributed fashion and actions are performed over multiple threads.
When reading in several different JSON files would keep erroring as Hadoop needs to be configured for the data to be stored in HDFS. Because I'm running locally, Spark is unable to continue.
I could load the JSON files one by one, but that sequential manner defeats the purpose of Spark in the first place.
I am continuing the data cleaning with Pandas instead
Managed to get a script working that intakes multiple Json files with my needed schema and outputs into a CSV table
Evaluation:
Closing this as there is a refactor intakejobs.py ticket that explains further developments
Intake jobs.py needs to process a chunk of raw json data and append it to a larger json format for storage and retrieval within the database
The end goal is to have the data cleaned and formatted in a uniform manner in line with the specification required for usage in report generation
there is a current separate task for assessing what data processing needs to be accomplished