Open gregpawin opened 4 years ago
@gregpawin Please provide an update
Cleaned data can be created via make data command using citation analysis branch
Reevaluating how often data needs to be kept up to date.
Was wondering about the status of this. The most recent citations I see in the database are from April 1, 2021. I think that's plenty of data to work with for now but the link to the preprocess.py script above is broken and I was wondering if we could put the existing data processing code somewhere and document its progress/usage.
@gregpawin This issue has not had an update since 8/3/21. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines
This issue is a DRAFT for now, but anyone can update the sections based on the format below, especially the Overview section. Once we know what needs to be done and why we can prioritize whether to work on this issue.
ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX
WE NEED TO DO X FOR Y REASON
A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.
REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.
Progress: Finished setting up IAM roles and permissions for AWS Glue job/role Blockers: Taking time to learn how AWS Glue works--ie. writing custom transforms in Python Availability: Will set at least 2 hours to work on it. ETA: I think I can have a beta version up in a week. Pictures (if necessary):
Progress: Still learning PySpark. Applied custom mapping, using the visual editor to create boilerplate code. Blockers: Learning PySpark Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Created DynamoDB table--discussing with Glen if we want to go with Dynamo or EC2 with MongoDB instead. It might also be good to have an API built in to interact with the DB Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Created script to find last updated date from API. Created a lambda to download the latest csv and upload to S3 bucket. Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Met with dev lead to decide on database technology--will go with MongoDB not DynamoDB to take advantage of geospatial functions. Blockers: None Availability: A few hours this week ETA: I hope by this week.
Overview
We need to create a data cleaning pipeline that takes in raw input data from the Socrata API and updates the AWS database with the correctly formatted geospatial data
Action items
Resources/Instructions