Identify addresses with significantly more 311 requests

nichhk commented 2 years ago

Overview

This can be very useful information for NCs and city agencies. Basically, we can identify addresses or small areas that could benefit from more signage, increased community assistance, or other actions.

This was actually one of the original goals of 311 Data (see Use Case Feasibility Report).

[Update 12/05/22] In progress HERE:

EDA on Neighborhood Councils and block-by-block
Geospatial analysis and folium maps with City of LA Neighborhood Council boundaries and Census LA block boundaries
Clustering analysis notebook on data set sample

Action Items

[x] Figure out ways to implement this in existing dashboard (TBD)
[x] Create script to output a csv specific to each NC with block IDs merged with NC request API csv

See repo HERE
[x] Update folium layered map with type filters
[x] Update streamlit app (cancelled)
[x] Add per capita variable to EDAs

joshuayhwu commented 2 years ago

At least at the NC level, we have visualization on the total number of requests over the years. See the bottom on the dashboard here. I can take a stab on using some clustering algorithm to further identify smaller regions.

nichhk commented 2 years ago

Thanks Josh! Yes, ideally, I think we'd want to get as granular as address-level, and then one notch above that, block-level. I think an individual NC would like to see if, e.g., 50% of their NC's 311 requests are coming from a single address.

joshuayhwu commented 2 years ago

Power BI Demo:

Next Steps:

Decide on the specific granularity
Generate a list of top 10 address/specified granularity for each NC
Think of how to implement this as a feature into existing dashboard

nichhk commented 1 year ago

Apparently we have an API endpoint that can produce "hotspots", see #1034. I'm not sure if this is helpful, or changes how we do things, but it's worth looking into.

joshuayhwu commented 1 year ago

The API use the clustering algorithm to identify hotspots. It is definitely useful if we want to implement it as a future feature. However it's not that useful for analysis purposes.

Wrote a quick function that basically round the longitude latitude pair by 2 decimal places and count the number of request in a neighborhood council. We can use this function for the 311 requests available every year since 2016. I can conduct some basic metrics like Year over Year comparison / quarter over quarters for the number of requests, but I 'll focus on bulky items, homeless encampments, and graffiti.

See function below:

def generate_hotspot_dataframe(df):
    """Generates the hotspots of each NC by the number of 311 requests.

    This function takes in a raw LA 311 requests dataframe and aggregate by
    the longitude and latitude of 311 requests in 2 decimal places for a 
    neighborhood council.

    Args: 
        df: raw LA 311 requests for any year.

    Return:
        An aggregate 311 request dataframe that contains the count of 311 requests
        per long/lat pair in each neighborhood council.
    """
    print("* Rounding requests Long/Lat to 2 Decimal Places")
    df['lat_2dp'] = df['Latitude'].round(decimals=2)
    df['long_2dp'] = df['Longitude'].round(decimals=2)

    print("* Aggregating dataframes")
    final_df = df.groupby(['NCName', 'lat_2dp', 'long_2dp'], as_index=False)['SRNumber'].count().sort_values(['NCName', 'SRNumber']).reset_index()
    return final_df

nichhk commented 1 year ago

I'm not sure if two decimal places is small enough--1 degree of latitude/longitude is 69 miles, so two decimal places would be 0.69 miles, which is quite considerable. We can fine tune the number of decimal places as necessary.

Those target request types look good to me! I would also add illegal dumping and animal remains. Both are issues that might be concentrated in certain areas, and could be addressed with additional signage.

joshuayhwu commented 1 year ago

Thanks for the review!

def generate_hotspot_dataframe(df, dp, req_type):
    """Generates the hotspots of each NC by the number of 311 requests.

    This function takes in a raw LA 311 requests dataframe, filter by "req_type" request type,
    and aggregate by the longitude and latitude of 311 requests to 'dp' number of decimal places 
    for a neighborhood council.

    Args: 
        df: a pandas dataframe with raw LA 311 requests for any year.
        dp: an integer for the number of decimal places to round the lat/long to.
        req_type: a string column name for the request type to filter the dataframe by.

    Return:
        An aggregate 311 request dataframe that contains the count of 311 requests
        per long/lat pair in each neighborhood council.
    """
    print("* Filtering dataframe by " + req_type)
    df = df[df['RequestType'] == req_type]

    print("* Rounding requests Long/Lat to " + str(dp) + " Decimal Places")
    df['lat_2dp'] = df['Latitude'].round(decimals=dp)
    df['long_2dp'] = df['Longitude'].round(decimals=dp)

    print("* Aggregating dataframes")
    final_df = df.groupby(['NCName', 'lat_2dp', 'long_2dp'], as_index=False)['SRNumber'].count().sort_values(['NCName', 'SRNumber']).reset_index()
    return final_df

req_type_lst = ['Graffiti Removal', 'Bulky Items', 'Homeless Encampment', 'Dead Animal Removal', 'Illegal Dumping Pickup']
for r in req_type_lst:
    final_df = generate_hotspot_dataframe(df, 2, r)
    final_df.to_csv("311_2020_Hotspot_" + r + ".csv")

Really rough function that generates corresponding dataframe for each request types. Still using 2 decimal points right now, but could be fine tuned now. Next step is to figure out a way to present this, or just send the list as is.

ajmachado42 commented 1 year ago

Hey Josh and Nich. I started digging in a little to familiarize myself with the 311 data around locations and request type. I'll bring questions I have from this initial exploration to the project meeting. I think you could do some clustering on past data to maybe predict types of requests in the different granular areas to help allocate resources but need to figure out how to do API calls to collect enough historical data and also create new features for granular location. The API call I used only gives up to 1000 records which was a question I was going to bring to the project call.

Here's where I'm storing all my code. https://github.com/ajmachado42/Hack-for-LA-311-Data

nichhk commented 1 year ago

Hey Dri, thanks for taking a look at this! To get all the requests for a certain date range, you can use this tool. Feel free to reach out to @priyakalyan if you have any questions about using it.

Re: the clustering: not sure if you saw this already, but we already have one implementation that does this. Please take a look and see if it looks useful to you.

Btw, if you're blocked on anything, feel free to reach out to us on Slack or write out your questions here on GitHub. It can be a pain to write them out, but we want to help our teammates to be productive throughout the week!

ajmachado42 commented 1 year ago

Thanks Nich! I'll definitely use this API code and take a look at the clustering!

ajmachado42 commented 1 year ago

I made some pretty decent headway on the EDA and identifying hot spots by neighborhood council and address in this notebook.

I'm still figuring out breaking LA into small hot spot chunks and then mapping out the data points there but I started going down a rabbit hole about geopandas so the research is taking a little longer than I thought it would.

Some points for tomorrow's meeting (09/28/22):

Size of each area to look at (each lat lon increment is about 69 miles, could break it into 100ths so .69 miles each)
Should "hot spots" only include addresses that have multiple requests? A lot of requests are one offs for bulky item pickups. When you break it down, graffiti becomes the number one offender for repeat requests.
API maxes out at 20000 requests -- 09/23/22-09/17/2022 were only date range able to be pulled

joshuayhwu commented 1 year ago

@ajmachado42 Thanks so much for the comprehensive update! The notebook is very clear and comprehensive.

I like the idea of breaking them into 100ths. I initially did 2 decimal place of each lat/lon but I figured it would be not granular enough. It would be great to see the distribution of counts after you break them down into .69 miles each. If there are too many "hot spot blocks", we can take a larger block.
As per our discussion during our meeting, I think the >=2 requests make sense. At the same time, I'd suggest checking the LA weekly/monthly/yearly nc request count average and treat that as the decision rule. Ultimately, we want something actionable and make an impact. If there are not that many requests we can't really do much about them as it's likely due to random chance / one-offs
Hmm is that the case with the get_request_tool? I'd just use the 2021 LA 311 dataset and download as csv instead. I can take a look at the API

Once again, thanks so much for your hard work - Let me know what you think!

ajmachado42 commented 1 year ago

@joshuayhwu Thank you, Josh! I'll work on this this week.

Anupriya shared some Census resources for mapping files that breaks LA into the official city blocks and I think her and Nich fixed the API bug after the meeting. I'm going to be visiting family in Florida this week but will have time to update my notebook with the full year data set and start doing some geospatial analysis as well.

ajmachado42 commented 1 year ago

Geospatial Analysis

Folium choropleth maps with geospatial analysis on Neighborhood Council level and block-by-block level (taken from Census data, thanks Anupriya!)
Some loss happened with the data as the raw data was broken down to granular levels. This might have been due to inconsistencies between locations in the boundary datasets.
I did inner joins for areas that were within the larger area (Neighborhood Council > Block > Request location points).
Can there be multiple entries for one request? Duplicate requestId's are appearing in the raw data set and traveling downstream to the block data set.
I saved a csv of the request data I have (10/01/2021 - 10/01/2022) merged with the block IDs and can do an EDA on it this next week
The geospatial notebook wouldn't render on my browser in GitHub, probably because it has a lot of data in it, but it should be able to be downloaded. Going to work on getting it up on Streamlit so it's presentable.

Clustering

Updated with DBScan cluster analysis. Ran on smaller sample of full dataset; not final but the code is there as well as some preliminary conclusions.
Could put the results into a classification model to predict clusters but this would need more processing power than I have locally

https://github.com/ajmachado42/Hack-for-LA-311-Data/tree/master/I-1279

joshuayhwu commented 1 year ago

@ajmachado42 Thanks so much for the comprehensive updates - really appreciate the documentation on the notebooks!

Geospatial Analysis:

I can't open the geospatial notebook after downloading the raw file as it takes up a lot of memory. Do you mind breaking it down into smaller sections?
I don't think there could be multiple entires for the same requests. If it appears downstream, it's likely that the requests' coordinate locates on the boundary, causing multiple combinations to appear. Perhaps decide on a rule to remove duplicate after joining?

Clustering:

I like the sampling idea. From the preliminary analysis, it seems like we'd only be able to get two main clusters which doesn't generate direct insights. Also, the aggregate DBSCAN would be biased by population of each NC (more densely populated NCs would have more requests). Perhaps we can stratified sample at an NC level then do DBSCAN?
Unfortunately processing power is something that's too expensive for this project.

ajmachado42 commented 1 year ago

@joshuayhwu I updated the visualization notebook so it's broken up more. Github still won't render the folium maps though.

This is my Drive link for it which has all the datasets, etc. Let me know if that works! (I was able to create a layered map by type in the nc_only notebook.) https://drive.google.com/drive/folders/1njMKXLcs6CSgcZ_Gs9Fwxr6Iq2Wro45m?usp=sharing

Noted about clustering. Once I finish getting the maps and block data set to a good spot then I'll shift to focusing on the cluster analysis more.

joshuayhwu commented 1 year ago

@ajmachado42 thanks for breakit up! Notebook looks good and I really appreciate the comments!

I can take a look at the app and see how to render it if that's your only blocker. Otherwise, happy to check in on other blockers. Let me know which area you want most help with. Thanks for your hard work this week!

ajmachado42 commented 1 year ago

Updated block data set and solved duplicates issue.
Researching a good data set for population density to get a per capita for each NC and type of request.
Working on function to get a csv specific for each NC
haven't had time to work on streamlit but should have time this week!

mc759 commented 1 year ago

Hey @ajmachado42 and @joshuayhwu, Do you have an update for us on this issue?

Please update:

Progress:
Blockers:
Availability:
ETA:

Thanks!

ajmachado42 commented 1 year ago

Hey @mc759

Progress:

Completed EDA on NC level data
Completed function for spatial joins for block level IDs per 311 request
Built a few folium maps to display a year's worth of geospatial data analysis on block level and NC level

Blockers:

Having issues rendering folium maps in Streamlit so may pivot to Tableau as a final dashboard instead. I'm not sure if a dashboard is needed or if the code is more useful to have as a resource.

Availability:

Feel free to reach out on Slack. Schedule is kind of all over the place right now.

ETA:

Depends on the direction the final presentation needs to go in. Probably no more than a day of work left though.

-Adriana (sent from mobile)

On Mon, Dec 12, 2022, 7:25 PM mc759 @.***> wrote:

Hey @ajmachado42 https://github.com/ajmachado42 and @joshuayhwu https://github.com/joshuayhwu, Do you have an update for us on this issue?

Please update:

Progress:

Blockers:

Availability:

ETA:

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/hackforla/311-data/issues/1279#issuecomment-1347693457, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARSELD3RR3XT7EPIQP4VILDWM7UC5ANCNFSM53CHQQ2A . You are receiving this because you were mentioned.Message ID: @.***>

ajmachado42 commented 1 year ago

Moving this one to closed after discussed with Josh. Lots of templates for analyses (statistical and geospatial) and mini program to generate a report that adds census block IDs to each request based on the address of the request. Feel free to reach out to me if you need anything!

hackforla / 311-data

Identify addresses with significantly more 311 requests #1279

Overview

Action Items