LAANE Data Cleaning and Analysis - Githubissues

hackforla / data-science

The Hack For LA Data Science team is a Community of Practice within the LA brigade seeking to make analytical and machine learning services available to local communities and organizations.

28 stars 17 forks source link

LAANE Data Cleaning and Analysis #36

Closed ryanmswan closed 2 years ago

ryanmswan commented 3 years ago

Data Scientist

Project Name: LA Alliance for a New Economy, Housing Project

Volunteer Opportunity: Assist local nonprofit by joining several datasets that rely on physical street address as a primary identifier. After joining these data sets together, display on a map localized by approximate region the density of AirBnB rental properties in relation to properties cited for complaints in order to identify potential "party houses" that may not be in compliance with local ordinances.

Duration: 4-6 weeks

Who to communicate your interest to

Slack name of person to contact in the channel: Sophia Alice or Ryan Swan or post in [#data-science]

Primary Stakeholder: LAANE

Currently Staffed: @albertulysses @karinalopez

Resources

Albert's video on using scripts he wrote for project

KarinaLopez19 commented 3 years ago

Just a quick update:

Albert and I are working to combine the airbnb listing dataset with permit dataset from the city of LA . Some complications we are handling include:

There are no shared columns in both the listing or permit datasets
Proximate coordinates are provided for listings while full addresses are provided for permit data

Currently, we are trying to see if there are "obvious" citation breakers or followers that we can filter out immediately. Additionally, we are writing functions and a script to potentially merge the permit dataset with listing dataset using coordinate and address data in a single neighborhood. Once we are able to have a working script, we can apply it to the full datasets.

AlbertUlysses commented 3 years ago

I added a new folder "airbnblistings" where Karina and I can upload our files related to the project and I created two functions that are going to help with data cleaning. The plan for now is to create a foundation for the project by cleaning the data and creating a sqlitedb .

KarinaLopez19 commented 3 years ago

Cleaned up city of LA registrant dataset, merging all sheets into a single CSV file and revising mistyped entries. Next steps would include creating geolocations for each address, and connecting registrant entries with airbnb listing host IDs.

KarinaLopez19 commented 3 years ago

Met with Jon to do final review of columns to keep vs. remove. Jon is onboard with having a SQLite database and is willing to learn SQL. Currently working on designing the SQLite database with Albert.

AlbertUlysses commented 3 years ago

First iteration design is complete, will do some adjustments then start writing the scripts to transformer the data.

KarinaLopez19 commented 3 years ago

Jon has been updated on progress and Albert and I will be meeting tomorrow to discuss next steps.

KarinaLopez19 commented 3 years ago

Organized a sheet containing new table and column names following data warehouse schema to facilitate data transformation. Beginning data dictionary to pass along to Jon following completion of data warehouse. I will be taking a short break from 06/26-07/06 and have communicated this to Jon + Albert. Albert and I plan to complete data transformation scripts starting 07/07.

AlbertUlysses commented 3 years ago

Here are some updates/milestones.

For the next week and a half, Karina and I will be taking a break - we'll be meeting again starting on July 7th. I'll still be around if anyone needs anything, so feel free to email me.

Milestones will be as follows:

* By the end of next week (July 2nd), I'll have a finalized version of our data warehouse layout.
* Starting on July 7th and until approximately two weeks after (July 21st). We'll be working on the code for our first tangible deliverable - an SQLite DB with some starter scripts. The starter scripts will answer some of the questions Jon sent earlier this year.
* After this, we will be asking for some follow-up meetings to begin creating a software program where Jon can add new data to the database as it comes in. This next phase doesn't have an exact timeline, but we will most likely start meetings in the last week of July (26th-30th). 
* The last week of July will also be our goal to have something written up about the project for Hack for LA and their Medium blog. The blog post will summarize the work we did, the problem(s), and how we solved it/them. If there is anything that anyone would like in it or off, let me know. Our post will also be a collaborative effort, so I'll reach out with rough drafts before publishing anything.

AlbertUlysses commented 3 years ago

Working through assessor data, we did some ERD modifications and discussed how to handle unique entries. Notes and Rough draft of blog posts have started, but we are not prioritizing it because we are trying to focus our extra time on the first deliverable.

AlbertUlysses commented 3 years ago

Quick update: reworked the project layout and added assessor transformations along with Tests. Going to try to work on several more datasources by the end of the week,and need to do some re-factoring, by then I'll have a better idea of a new time frame for the first deliverable. I will contact Jon in an email and include everyone by the end of week.

AlbertUlysses commented 3 years ago

Hey @ryanmswan @salice , here is a rough idea about what the hackforla blog post will be about. Let me know if this is more or less the idea you had or if there is some other angle I should include:

Blog post idea:
The blog post will mostly focus on what lessons I learned from the project.
Things I learned are:
How to optimize and test for Pandas - only used here and there mostly use PySpark. It is nice to learn about apply versus list comprehension versus str method chaining and the speed they each give. I also learned about the assert_series_equal/assert_frames_equal testing functions.
I gained a lot from discussing the project with my teammate Karina and talking to Jon about what he needed.
If there are any extra ideas, let me know, and I'll see if I can add them in.

AlbertUlysses commented 3 years ago

@ryanmswan @salice Most of the transformations are complete - there is one data source that Karina was working on, and I'm waiting for that to be uploaded. Next week I'll be writing the code for entering the data into the SQLite database using SQL Alchemy.

AlbertUlysses commented 3 years ago

I added the SQL Alchemy code, I need to add Air BnB tables into that file, and it might need some refactoring to reflect the database relationships. After, I'll work solely on the main program, which will go through all the files and insert them into the database.

AlbertUlysses commented 3 years ago

Airbnb tables are now in the SQL Alchemy file.

AlbertUlysses commented 3 years ago

Quick update. We have data in the database. It's not all of it, but it's a start! I have to do some refactoring because the code I wrote is hideous, but all tests are passing. At this point, I think I'll have the database done by the end of the week - worst-case scenario next week (if I get real busy at work). I'm still looking for some feedback on what I need to write for the blog - we can discuss it this Thursday. @ryanmswan @salice

AlbertUlysses commented 3 years ago

Here's a quick update on where the project is. 8 out of 19 tables have all the data inserted, and I'll be working on the final 11 next week. I have messaged Jon and hope to meet with @salice and @ryanmswan on the week of September 6 to discuss the next steps and misc questions about storing the data.

AlbertUlysses commented 3 years ago

Update: added more scripts. Messaged Jon and he's available sometime next week to discuss the project's next steps. 10/19 tables are complete/

AlbertUlysses commented 3 years ago

Quick update, all but one dataset are complete. I'll be working on the last one the rest of the week, and I intend to deliver the completed dataset the following Monday.

AlbertUlysses commented 3 years ago

@ryanmswan @salice @KarinaLopez19 All of the scripts are complete. Next steps will be writing some documentation and handing the project over. I'll be making a call next week with everyone!

AlbertUlysses commented 3 years ago

@KarinaLopez19 @ryanmswan @salice Hey I just finished the README I think that's going to be one of my last git commits for a while. I asked Karina to check it out and let me know if there are any questions so I can clarify. Ryan and Sophia, feel free to check out the code as well.

akhaleghi commented 3 years ago

Video on using scripts, courtesy of Albert and Karina

ExperimentsInHonesty commented 3 years ago

We should do a cleanup of this issue to summarize any info we need to keep from the comments into the top part.

akhaleghi commented 2 years ago

@KarinaLopez19 we want to add a "size" label to this, just so we can keep track of the number of hours it took. Do you have an estimate of how much time you and Albert spent on this, and how much more there is to go?

akhaleghi commented 2 years ago

Hey @AlbertUlysses and @KarinaLopez19, please provide any appropriate updates on this issue in the comments, since we haven't had anything documented here for a few months now.

Progress: "What is the current status of your project? What have you completed and what is left to do?" Blockers: "Difficulties or errors encountered." Availability: "How much time will you have this week to work on this issue?" ETA: "When do you expect this issue to be completed?" Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."

AlbertUlysses commented 2 years ago

Hi @akhaleghi I don't have any updates - from our last conversations this is the way the project should move forward: Karina -> add data dict

Team -> reach out to LAANE to see if someone else on our team can do some analysis - last we spoke, Karina suggested that she might take this on.

I don't know Karina's status, but if she's no longer contributing, maybe someone else is interested in taking the data and analyzing it? It definitely could be a good learning experience.

mcmorgan27 commented 2 years ago

@Abe Khaleghi @.***>

I was going to check this out. I've looked at the code but I can't find the data? I see a long list of datasets... are they all open data? Is there a repository with the csv files? I suspect I can find many of them, but ...

Is it better for me to ask these questions on the github issue or here?

Let me know.

mcm

On Fri, Apr 8, 2022 at 9:12 AM Albert Ulysses @.***> wrote:

Hi @akhaleghi https://github.com/akhaleghi I don't have any updates - from our last conversations this is the way the project should move forward: Karina -> add data dict

Team -> reach out to LAANE to see if someone else on our team can do some analysis - last we spoke, Karina suggested that she might take this on.

I don't know Karina's status, but if she's no longer contributing, maybe someone else is interested in taking the data and analyzing it? It definitely could be a good learning experience.

— Reply to this email directly, view it on GitHub https://github.com/hackforla/data-science/issues/36#issuecomment-1093050927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6N357DYRLYRXAQV3TRQSLVEBLG7ANCNFSM43TMIOFA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AlbertUlysses commented 2 years ago

@mcmorgan27 Data should be in S3 - it's not open data - it's data from LAANE. I have no knowledge where specifically it's stored- I believe Sophia was the person in charge of storing that data - could be wrong. Also, I believe that the data wasn't stored more publicly because LAANE had some reservations in that regard.

mcmorgan27 commented 2 years ago

Thanks. Sounds like data issues, so I'll hold off.

On Fri, Apr 8, 2022 at 3:30 PM Albert Ulysses @.***> wrote:

@mcmorgan27 https://github.com/mcmorgan27 Data should be in S3 - it's not open data - it's data from LAANE. I have no knowledge where specifically it's stored- I believe Sophia was the person in charge of storing that data - could be wrong. Also, I believe that the data wasn't stored more publicly because LAANE had some reservations in that regard.

— Reply to this email directly, view it on GitHub https://github.com/hackforla/data-science/issues/36#issuecomment-1093421799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6N35ZTQSGNUJLJIY5FRM3VECXQLANCNFSM43TMIOFA . You are receiving this because you were mentioned.Message ID: @.***>

ExperimentsInHonesty commented 2 years ago

We are going to close this issue. Can be reopened or referred to if stakeholder reaches back out to us. He has not responded to our requests for addional engagement.