joelansbro / pipeline

API Pipeline DB middleware
2 stars 0 forks source link

Create intakejobs.py #4

Closed joelansbro closed 2 years ago

joelansbro commented 2 years ago

Intake jobs.py needs to process a chunk of raw json data and append it to a larger json format for storage and retrieval within the database

The end goal is to have the data cleaned and formatted in a uniform manner in line with the specification required for usage in report generation

there is a current separate task for assessing what data processing needs to be accomplished

joelansbro commented 2 years ago

Running PySpark has lead to further considerations around the overall project structure. PySpark is meant to run in a distributed fashion and actions are performed over multiple threads.

When reading in several different JSON files would keep erroring as Hadoop needs to be configured for the data to be stored in HDFS. Because I'm running locally, Spark is unable to continue.

I could load the JSON files one by one, but that sequential manner defeats the purpose of Spark in the first place.

I am continuing the data cleaning with Pandas instead

joelansbro commented 2 years ago

Managed to get a script working that intakes multiple Json files with my needed schema and outputs into a CSV table

Evaluation:

jagithub2 commented 2 years ago

Closing this as there is a refactor intakejobs.py ticket that explains further developments