Open jeremysze opened 5 years ago
Received traffic signal data from DOT Roseann Caruana Customer Service
Great news on the traffic signals data, that should really make a difference since I think you really want to limit your control group to signaled intersections that might potentially get an LPIS intervention! now you can go back and edit your updates post and use a strikethrough on your list :) Use two ~~
on each side of text
On intersections data.. as you might have already discovered that osmnx library pulls in LION line data and makes it very easy to identify intersections as nodes (the example notebook does that at one point). May not be faster or better than your method, but worth a mention.
That way to build a control group sounds like the place to start. Worth also thinking about (as a step toward matching methods but interesting in its own right) is building a simple logit/probit model of how the city might be targeting LPIS locations (e.g. more accident prone, only on two way streets, closer to schools, etc) so as to better understand the intervention.
This is looking really great.
P.S. -- Just glanced at your efforts to pull in data directly via API... looks awesome.
The first two on the list are now ~scratched out~ because we will now connect the collisions to the traffic signal intersections. I will begin from scratch to rebuild the datasets, so that they can now "flow" through the notebooks until the panel analysis stage.
I will incorporate the cKDTree spatial example into the distance and ID capture to replace the less efficient method.
Utilized cKDTree! Its super fast! Next to create the "control" intersections by looking within the buffer. Outcomes are also set up.
Utilized cKDTree! Its super fast!
Great. I'm curious, how does it compare in speed to doing the same thing via QGIS?
Next to create the "control" intersections by looking within the buffer. Outcomes are also set up.
Neat.. So how are you doing that? Are you looking for all non-LPIS intersections within a buffer region around each LPIS?
I get the results almost instantaneously. I usually leave QGIS to work on the distance and come back to it later. At the moment it is still looking at those within the buffer. With that I can designate a control group to each LPIS and utilize Synthetic control groups or entropy balance.
Don't think i can take timings in QGIS like you did in your examples.
Do you have any conferences to recommend for abstract submission?
See this list for conferences that welcome student submissions. Several say for undergraduate but MA and PhD students too
I think I need some help thinking about how to utilize the "control" signal intersections that fall within 2000ft of a LPIS intersection.
At this moment, my data is organized as a panel with intersections, months and years. When we were exploring the data earlier, we used relative months to compare the LPIS intersections and the control intersections. At that point, we were not able to "set" a switch on date because LPIS intersections were implemented at different dates. Hence, we spoke about potentially identifying nearby intersections as control intersection.
But LPIS intersections that are close to each other have been implemented on different dates. Hence, when we identify control intersections that are within 2000ft of a LPIS intersection, we will have the same control intersections identify for LPIS 1 and LPIS 2.
Right now I have reshaped the data into a wide format unique by intersection_id, and it has variables of LPIS_IDS.
On another note, I was thinking that since I know have an official list of signal intersections and I can obtain a list of non-signal intersections. With this information, I could some sort of analysis on those intersections to find out if they should get a traffic signal controller. This idea is probably out of scope for my thesis. But it sounds really interesting to me now that I've thought about it.
Interesting questions. Very brief stream of consciousness responses/ideas now but let me think and read further:
Did you already run a straight diff on diff with the control group being all non-LPIS signaled intersections? Just like you'd run (the regression that Partha said is all you needed) before I suppose only that now you know what's signaled and what is not. How does that turn out?
The issue you are flagging must be fairly common with any impact evaluation with a pipeline rollout over time, so I'm guessing people have thought through the issue. How are you estimating to implement this 'matched' control group? As you point out within a 2000ft buffer there may not be a lot of control intersections so the selective reduction of the size of the control group could be a problem.. Maybe this points to not using a buffer but rather other types of matching methods. drawing from a much larger donor group (and there are other reasons to favor that approach... e.g. the other intersections in the buffer close to say a major intersection that just got LPIS may be less similar than other major intersections with similar characteristics drawn from further away..
Another possibly related issue that is interesting to think about is about spatial correlation and 'contamination' If you put in a new LPIS that forces drivers to think a bit more about pedestrians, the same driver may be more cautious than they otherwise might have been when they get to the next corner.. That might lead to downward bias in the estimated impact (since accidents go down in nearby controls).
-I ran the straight diff on diff with control group being all non-LPIS signaled intersections. The results are similar to what we have seen before.
Outcome: Collision Count significant downward direction Outcome: Number of persons injured significant downward direction Outcome: Number of pedestrians injured significant downward direction Outcome: Number of cyclist injured not significant Outcome: Number of motorist injured significant downward direction
I can try to find a list of major intersections and find the distance to each signalized intersection. I think this makes a lot of sense to balance on.
-I ran the straight diff on diff with control group being all non-LPIS signaled intersections. The results are similar to what we have seen before.
Very nice.. One question: is this a straight diff-in-diff with treatment effect estimated as a fixed additive shift? In other words are you getting an estimate that says, after LPIS there are (say) 0.19 fewer incidents in a year. Depending on how you specify the outcome variable I'm just concerned about the 'parallel' trend assumption. Standard stuff.
Suppose that there are two types of intersection in the city (H)igh accident intersections that have on average 10 accidents a year and (L) accident intersections that have 1 accident a year. Suppose they were both on a downward trend such that after 1 year we'd expect 10% fewer accidents. Then the H locations would see average accidents fall from 10 to 9 or by 1 accident per year. The L locations would be expected to see the rate fall by 0.1 from 1 to 0.9. Suppose the city now installs LPIS mostly at H sites and that the LPIS in fact has zero effect. A simple difference in difference would estimate a positive impact of the LPIS (as much as 0.9 = 1.0 - 0.1 if we compared LPIS locations to only L locations, somewhat less if the control group is a mix of H and L locations).
Taking logs of the HS variable gets around the problem as you'd be then in effect comparing percent changes (although logs presents a problem here where there are 0 accident intersections so we'd ). I think you'd thought through some of these issues in choosing other specifications.
The other way to get around this possible bias of course is to work hard to have a well matched control group... which you are also working on.
Right now I will have to check on the parallel trend assumption. My current diff-in-diff is estimated as a fixed additive shift.
Taking the logs of the outcome variable will be an issue because collisions do not occur every month for all intersections.
I hope the DOT replies with what variables they've considered in deciding if an signalized intersection should be a LPIS intersection.
I've also run the spatial regression. Right now its just with spregress. I need to look more into the interpretation of the spatial weights.
DOT Commissioner Request. Contact Robert Viola at 212-839-7752 or RViola@dot.nyc.gov. Will contact him to learn more about what they use to decide on LPIS intersections.
Might have figured out the issue with Street improvement shapefile. It is a multipoint shape, but geopandas don't seem to have the multipart to single part function yet.
Did the function in QGIS > Vector > Geometry Tools > Multipart to Singleparts. Will bring this into the model when I get additional information from Robert Viola about how DOT decides on LPIS intersections. Hopefully, he replies.
Working on understanding/interpretation of spatial regressions.
So is the issue to convert a 'mutipoint' to a many single points? Is it your understanding that the points grouped into a multipoint when they were thought of as a joint intervention? Are the points tagged to indicate the type of intervention within a joint intervention? If they are clearly identified, you could potentially expand/later extend your focus to study these joint interventions (e.g. does LPIS plus speed reduction do more than just LPIS...?)..
As usual, I'm several steps behind you trying to learn some python geospatial methods. It seems one can 'explode' or otherwise iterate through the singleparts in a multipart file with fiona and shapely:
https://gis.stackexchange.com/questions/138163/exploding-multipart-features-in-qgis-using-python
https://gis.stackexchange.com/questions/112095/converting-huge-multipolygon-to-polygons
Yes. It seems that for the Street improvements, they have grouped several points into one point. I think that they have specific coordinates of corner renovation, light replacements, etc, which are grouped into 1 point.
I will try one of these methods so that I can have it down in code.
Robert Viola responded. I will dig into the document to identify the selection criteria.
Hi Jeremy- The vast majority of the LPIs have be installed following the actions of our Borough Pedestrian Safety Action Plans. They all have actions for LPIs, targeted to our Priority Geographies. http://www.nyc.gov/html/dot/html/pedestrians/ped-safety-action-plan.shtml
For example:
Significantly expand exclusive pedestrian crossing time on all Brooklyn Priority Corridors by the end of 2017 High-crash corridors for pedestrians tend to be on wide arterial streets with higher speeds and aggressively turning vehicles. DOT will address these issues by installing Leading Pedestrian Intervals (LPIs) at every feasible school crosswalk on all Brooklyn Priority Corridors. The LPI is a proven method of reducing pedestrian-vehicle conflicts at high pedestrian crash locations; it is a signal timing treatment that provides pedestrian-only walk time before vehicles, including turning vehicles, receive the green light.
Add exclusive pedestrian crossing time to all feasible Brooklyn Priority Intersections by the end of 2017 DOT will install LPIs at every feasible Brooklyn Priority Intersection by the end of 2017. As noted previously, the LPI is a signal timing treatment that provides pedestrian-only walk time before vehicles receive the green light.
Several new variables to control for
Using pedestrian KSI data from the last five available years (2009-2013), DOT developed a process for selecting Priority Corridors, Priority Intersections, and Priority Areas.
Pedestrian KSI data was employed in this analysis for two reasons. First, a pedestrian who has been severely injured typically departs the crash scene in an ambulance and often experiences life-changing injuries (e.g., loss of mobility, brain function, limbs). A comprehensive street safety program must address these types of pedestrian injuries as well, not just fatalities. Second, severe injuries are more numerous and less randomly dispersed than traffic fatalities. Thus, severe injuries are more useful and reliable in terms of ranking one corridor, intersection, or area.
Priority Corridors
To determine the Priority Corridors, all corridors in the Bronx were ranked on a pedestrian KSI per-mile basis. Corridors were selected from the top of this list until the cumulative number of pedestrian KSI reached half of the borough’s total.
Priority Intersections
In order to identify which of the Bronx’s 6,438 intersections have the highest need and greatest potential safety gains, DOT used an approach similar to the Priority Corridor process. DOT selected the intersections with the highest number of pedestrian KSI that cumulatively account for 15% of the borough’s total pedestrian KSI.
Priority Areas
Some of the safety issues throughout the Bronx occur systematically at an area-wide level and are not confined to a single intersection or street. To account for these areas, the pedestrian KSI crash dataset was transformed into a kernel density map—or heat map—which indicates where the density of these crashes is highest.
Excellent, very useful information for understanding the program and its placement
On Mon, Nov 26, 2018 at 4:16 PM Jeremy Sze notifications@github.com wrote:
Several new variables to control for
- population density
- wide arterial streets
- Midblock crossing
- Borough Priority corridors
- Borough Priorty intersections
- Borough Priorty area
- Lagged pedestrian KSI (Most ped. and cyclist fatalities and severe injuries occur in the same areas of NYC)
- complex intersection (more than 2 roads crossing)
Using pedestrian KSI data from the last five available years (2009-2013), DOT developed a process for selecting Priority Corridors, Priority Intersections, and Priority Areas.
Pedestrian KSI data was employed in this analysis for two reasons. First, a pedestrian who has been severely injured typically departs the crash scene in an ambulance and often experiences life-changing injuries (e.g., loss of mobility, brain function, limbs). A comprehensive street safety program must address these types of pedestrian injuries as well, not just fatalities. Second, severe injuries are more numerous and less randomly dispersed than traffic fatalities. Thus, severe injuries are more useful and reliable in terms of ranking one corridor, intersection, or area.
Priority Corridors
To determine the Priority Corridors, all corridors in the Bronx were ranked on a pedestrian KSI per-mile basis. Corridors were selected from the top of this list until the cumulative number of pedestrian KSI reached half of the borough’s total.
Priority Intersections
In order to identify which of the Bronx’s 6,438 intersections have the highest need and greatest potential safety gains, DOT used an approach similar to the Priority Corridor process. DOT selected the intersections with the highest number of pedestrian KSI that cumulatively account for 15% of the borough’s total pedestrian KSI.
Priority Areas
Some of the safety issues throughout the Bronx occur systematically at an area-wide level and are not confined to a single intersection or street. To account for these areas, the pedestrian KSI crash dataset was transformed into a kernel density map—or heat map—which indicates where the density of these crashes is highest.
source http://www.nyc.gov/html/dot/downloads/pdf/ped-safety-action-plan-bronx.pdf
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeremysze/LPIS/issues/3#issuecomment-441800870, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0JH_r0GTjvXPhGbBgcFAFLsvAuEU-yks5uzFougaJpZM4X_z10 .
https://popfactfinder.planning.nyc.gov/#12.25/40.724/-73.9868
It seems we have population data by city blocks here. But no means to obtain it. Other data I have found are PUMAs, which have a min of 100,000 people.
Tried downloading Population & Housing Unit Counts — Blocks to see if they have population data by blocks. https://www.census.gov/geo/maps-data/data/tiger-data.html
Census tract level data does seem to be available if you follow the info here https://gis.stackexchange.com/questions/9380/where-to-get-2010-census-block-data
I just tried getting data for Queens and was able to download (I clicked on employed civilian population b/c there was such a long list of available variables that I didn't go searching for total pop but it's presumably there). I'm blown away by how much data is available for download these days!
On Mon, Nov 26, 2018 at 6:00 PM Jeremy Sze notifications@github.com wrote:
https://popfactfinder.planning.nyc.gov/#12.25/40.724/-73.9868
It seems we have population data by city blocks here. But no means to obtain it. Other data I have found are PUMAs, which have a min of 100,000 people.
Tried downloading Population & Housing Unit Counts — Blocks to see if they have population data by blocks. https://www.census.gov/geo/maps-data/data/tiger-data.html
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeremysze/LPIS/issues/3#issuecomment-441833271, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0JHzWrorl7uGrwN4XPpFaC_KCW-YeCks5uzHJ0gaJpZM4X_z10 .
I've found one that goes down to block level! TIGER data ("tabblock2010_36_pophu" > "Population & Housing Unit Counts — Blocks")
Now I need to think about how to bring the information into the dataset of intersections. I wonder if it matters since I will be running a fixed effects panel.
Also I cant seem to find the shapefiles for Priority corridors, intersections and areas for each of the boroughs. Might have to do another request with DOT, or try to use raster layers (never tried before).
Opened another ticket to request for information from DOT. DOT-397860-W3Z1 Will probably try to email Robert as well.
@jeremysze can add @partha-deb to have access to this github repo?
Added! I am running the code to calculate inverse distance matrix. Takes a really long time.
Hi Jeremy, Regarding your request about LPI studying. Would you please give me a call at 212-839-3370?
Danny Nguyen Director, Signals Timing Engineering NYCDOT – Traffic Operations 34-02 Queens Boulevard Long Island City, NY 11101 Tel: 212-839-3370
Heard back from the DOT. Going to call tomorrow.
Also I cant seem to find the shapefiles for Priority corridors, intersections and areas for each of the boroughs. Might have to do another request with DOT, or try to use raster layers (never tried before).
I actually found them after some digging! http://www.nyc.gov/html/dot/html/about/vz_datafeeds.shtml
Jeremy --
I'm not sure how you are going about calculating the inverse, but sometimes that can take way too long. Often, it is easier to compute an approximation - for which there are several methods available that only rely on matrix multiplication, which is much faster. Let me know if you need help with that.
@mbaker21231 I am using spmatrix create idistance
I would like to try the approximation methods.
@mbaker21231 I found this potential solution on statalist. https://www.statalist.org/forums/forum/general-stata-discussion/general/1398532-inverse-distance-spatial-matrix-using-stata
@mbaker21231 got the inverse distance matrix! But I would still like to learn your approximation methods.
Short 10 minute call with Danny. Could not hear him too well but I got the main points.
There are 4 main areas of how they select intersections to be LPIS
He made the point that they can't implement LPIS on major intersections because they have too many turns, or there is not enough time.
DOT website's Vision Zero Report has a section that says LPIS reduce pedestrian collisions by 40%.
Summary of today's discussion Attendence: Prof Conning, Prof Deb, Prof Baker, Jeremy, Christian, Anjelica
Next steps
https://geodacenter.github.io/workbook/4b_dist_weights/lab4b.html Thiessen Polygons! Found a solution which we briefly discussed at Wednesday's meeting. To turn the point into a polygon in order to calculate a rook contiguity matrix. Got to find the python version of this solution.
Cool. What's the broad strategy? The python pysal library is the main one for spatial weighting and regression. There may be an option in there. Also/alternatively from my very brief reading, if you have the road network as a network (as you could from osmnx) there seems to be an adjacency matrix option that may be relevant.
On Fri, Dec 7, 2018, 12:57 AM Jeremy Sze <notifications@github.com wrote:
https://geodacenter.github.io/workbook/4b_dist_weights/lab4b.html Thiessen Polygons! Found a solution which we briefly discussed at Wednesday's meeting. To turn the point into a polygon in order to calculate a rook contiguity matrix. Got to find the python version of this solution.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeremysze/LPIS/issues/3#issuecomment-445131384, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0JH9RSxOX2uipZYY-dSOYX_98mcts2ks5u2gM5gaJpZM4X_z10 .
I was looking at pysal. But it does not seem like they have the Thiessen Polygons function. I was just planning to test out GeoDa software first.
If we can do nearest line to point, I will be able to "merge" the signal intersection into the lion road shapefile. I could also convert the line into points, and merge it. Then using some lion road ID, to merge the data into the line geopandas. This way we can use osmnx to do the adjacency matrix.
I may be misunderstanding, but aren't things simpler? In the example from this notebook describing spatial weights using pysal it looks like they start from point data and then create a spatial weights matrix
They start with point data that looks like this
They produce a 'rook' adjacency network
rW = ps.rook_from_shapefile(shp_path, idVariable='FIPS')
To get the weights matrix it seems you'd do:
Wmatrix, ids = rW.full()
Instead of rook it could be queen, bishop, K nearest neighbors or any other 'adjacency' criterion. Presumably a '1' would be entered in the matrix for each that is adjacent and 0 otherwise.
This plot shows the 'rook' relationship extracted:
The Thiessen polygons approach seems to be first tesselating the intersections and then measuring adjacency distance from the centroids of those new polygons (which will be generally different from intersections). But why do all that when you can measure from the points to begin with?
I think I probably was not searching for the right term in the pysal document or not understanding. But this solution look right. I am going to try it.
I've build the panel dataset with the new variables.
Actually...the simple spatial contiguity methods may not be able to deal with grids that are not on a N-S orientation (e.g. if they are tilted say 45 degrees then the 'rook' method might find not the closest intersection but the one diagonally away on the same block.
What we need is a nearest 'connected' neighbor. I started to play with neighbors and adjacency matrix methods to the networkx representations of the street network one can get with osmnx in this notebook.
The methods would seem to work but the osmnx street network has 'too many' minor intersections (e.g one for each lane on divided large roads), so it doesn't really map well back onto the data you've worked to create.
Maybe if we use lion street shapefile, overlay the signalized intersection, remove the other minor (no traffic signal intersections(nodes). We can narrow it down to just the network of signalized intersections.
I was trying to run the code with a cross section of the panel, but I received an error AttributeError: 'Point' object has no attribute 'bbox'
.
https://github.com/jeremysze/LPIS/blob/master/analytical%20panel%20shapefile%20-%20quarter%20-%20spatial%20weights.ipynb
I'm guessing the issue is that you might be giving it non-connected intersection points whereas, from the pysal docs) "Rook contiguity defines as neighbors any pair of polygons that share a common edge in their polygon definitions."
The original 'lion street shapefile' you mention might allow you to build the adjacency matrix:
wr=rook_from_shapefile("xxx.shp"), "POLYID")
but there would be the problem you've mentioned of many false/small/repeat intersections. As I understand things you extracted the intersections as points (which would lose you the 'edges') and then merged or otherwise removed those that were not important intersections..
So (conceptually at least) the way forward might involve somehow removing false/small/repeat intersections without losing the other points and their edges.
Related question: how is the 'Lion street shapefile' you are using different from the TIGER/Line Shapefile that's also available for NYC and that osmnx uses and loads?
The osmnx library works as a front-end for pysal to find neighbors and adjacency/contiguity relationships and it has a few methods for simplifying road networks.
To load the road network near my neighborhood:
G = ox.graph_from_point(( 40.75,-73.88), distance=750, network_type='drive')
The 'drive' network type makes sure that it's not pulling in secondary roads, and every point is an intersection etc. These could be either kept or perhaps merged while
This still leaves some intersections with two nodes very near each other (e.g. on 34th Ave close to my home... but that's because there is a garden divider between the two sides of the road. But if I'm understanding you could merge very close intersections using the clean_intersections command
osmnx.simplify.clean_intersections(G, tolerance=15, dead_ends=False)
where nodes "within this tolerance distance (in graph’s geometry’s units) will be dissolved into a single intersection." It returns the centroid of the cluster.
So if it is not too much work I wonder whether you could start from the TIGER/Line data that osmnx pulls in, simplify as above and then merge in your data on signaled intersections (since we've seen already that the OSM data seems incomplete). You'd then have some additional info on edges (e.g. which way traffic flows) but also presumably ought to be able to calculate those adjacency/contiguity matrices..
This may of course be more work than it is worth or there could be some downsides I'm not seeing, so don't feel the need to go down this path just because I ended up speculating about it all!.
Yes.. after the finals please and only if you think it's worthwhile.
The osmnx library clean_intersections
is illustrated in 14-clean-intersection-node-clusters.ipynb
What's returned is a geoseries but, according to the notebook, "these cleaned up intersections give us more accurate intersection counts and densities, but do not alter or integrate with the network's topology." In other words you get points without edges (but we need edges for the adjacency matrix!) I've posted a brief stack overflow comment/question to Geoff Boeing the creator of osmnx to see if he has any suggestion to a also simplify the road network.
I think it is definitely worthwhile since this is going to be a problem with any other similar structured analysis. Its also quite fun to solve these problems.
Decided to try thiessen polygons. Seems like it might be doing what we want?
Awesome picture..I want a poster of that on my wall. .Thiessen polygons do seem to do the trick since at the end of the day they are labeling contiguity and map back onto your original intersections. And seems more straightforward than the paths I was suggesting.
On Tue, Dec 11, 2018 at 4:17 PM Jeremy Sze notifications@github.com wrote:
[image: thiessen polygons] https://user-images.githubusercontent.com/6887005/49830750-1113d400-fd60-11e8-97cd-13326a3ffdf4.PNG
Decided to try thiessen polygons. Seems like it might be doing what we want?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeremysze/LPIS/issues/3#issuecomment-446365114, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0JH6TZYj267pN5PSSqSddSIkabi2eUks5u4CDUgaJpZM4X_z10 .
https://github.com/jeremysze/LPIS/blob/master/Spatial_regression_qt.ipynb
I brought it into stata. Stata's graphs looks pretty decent. I will have to "reconnect" that cross section data into the panel. Or just calculate the spatial matrix, and load that matrix when I ran the spxtregress
. This should technically work.
https://github.com/jeremysze/LPIS/blob/master/Spatial_regression_qt.ipynb
Ran it with the panel data. It seems that most people with spatial analysis calculates moran's I to determine if spatial auto-correlation exists. But most moran's I calculation is done on cross sectional data.
Since my data is a panel data, it will be more challenging.
Found a paper describes how to do bootstrap calculations to get moran's i for spatial panel data. https://www.researchgate.net/publication/262384241_Moran's_I_test_of_spatial_panel_data_model_-_Based_on_bootstrap_method https://www.sciencedirect.com/science/article/pii/S026499931400162X Apparently there is a FDB Moran's I that I should be using.
Replicate QGIS steps into python commands for easy "replicate-ability"
Within Stata: