hackforla / lucky-parking

Visualization of parking data to assist in understanding of the effects of parking policies on a neighborhood by neighborhood basis in the City of Los Angeles
https://www.hackforla.org/projects/lucky-parking.html
33 stars 61 forks source link

Create data cleaning pipeline in AWS #149

Open gregpawin opened 4 years ago

gregpawin commented 4 years ago

Overview

We need to create a data cleaning pipeline that takes in raw input data from the Socrata API and updates the AWS database with the correctly formatted geospatial data

Action items

Resources/Instructions

ExperimentsInHonesty commented 4 years ago

@gregpawin Please provide an update

  1. Progress
  2. Blockers
  3. Availability
  4. ETA
gregpawin commented 4 years ago
  1. Progress Created preprocess.py. Still needs work.
  2. Blockers Need to figure out how to implement in AWS Glue. Also, need to finish car/aliases
  3. Availability Couple hours/week
  4. ETA 1-2 weeks
gregpawin commented 4 years ago
  1. Progress Created Lambda function to download whole dataset and created Glue table but stopped before doing ETL
  2. Blockers Maybe this isn't necessary. Need to discuss next project redesign with PM
  3. Availability Couple hours/week
  4. ETA 1-2 weeks
gregpawin commented 3 years ago

Cleaned data can be created via make data command using citation analysis branch

gregpawin commented 3 years ago

Reevaluating how often data needs to be kept up to date.

simzou commented 3 years ago

Was wondering about the status of this. The most recent citations I see in the database are from April 1, 2021. I think that's plenty of data to work with for now but the link to the preprocess.py script above is broken and I was wondering if we could put the existing data processing code somewhere and document its progress/usage.

tmlin1 commented 2 years ago

@gregpawin This issue has not had an update since 8/3/21. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines

  1. Progress: "What is the current status of your project? What have you completed and what is left to do?"
  2. Blockers: "Difficulties or errors encountered."
  3. Availability: "How much time will you have this week to work on this issue?"
  4. ETA: "When do you expect this issue to be completed?"
  5. Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
gordonruby commented 2 years ago

This issue is a DRAFT for now, but anyone can update the sections based on the format below, especially the Overview section. Once we know what needs to be done and why we can prioritize whether to work on this issue.

Dependencies

ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX

Overview

WE NEED TO DO X FOR Y REASON

Action Items

A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.

Resources/Instructions

REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.

gregpawin commented 1 year ago

Progress: Finished setting up IAM roles and permissions for AWS Glue job/role Blockers: Taking time to learn how AWS Glue works--ie. writing custom transforms in Python Availability: Will set at least 2 hours to work on it. ETA: I think I can have a beta version up in a week. Pictures (if necessary): image

gregpawin commented 1 year ago

Progress: Still learning PySpark. Applied custom mapping, using the visual editor to create boilerplate code. Blockers: Learning PySpark Availability: Will work on it more over the weekend. ETA: I hope by next week.

gregpawin commented 1 year ago

Progress: Created DynamoDB table--discussing with Glen if we want to go with Dynamo or EC2 with MongoDB instead. It might also be good to have an API built in to interact with the DB Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.

gregpawin commented 1 year ago

Progress: Created script to find last updated date from API. Created a lambda to download the latest csv and upload to S3 bucket. Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.

gregpawin commented 1 year ago

Progress: Met with dev lead to decide on database technology--will go with MongoDB not DynamoDB to take advantage of geospatial functions. Blockers: None Availability: A few hours this week ETA: I hope by this week.