insight-lane / crash-model

Build a crash prediction modeling application that leverages multiple data sources to generate a set of dynamic predictions we can use to identify potential trouble spots and direct timely safety interventions.
https://insightlane.org
MIT License
113 stars 40 forks source link

Define standard: concerns #69

Closed terryf82 closed 6 years ago

terryf82 commented 6 years ago

Define a standard for the format that concerns data needs to be supplied in to be usable by the project:

Similar to the issue for defining a standard on crash data - this may not be the current standard that is employed by Boston data. We should define what we want our ideal standard to look like first, and then if necessary look at middleware to translate any city's data standard into ours.

Vision Zero Network may be interested in the outcome of this issue, so we should ensure that we include all attributes that might be important.

terryf82 commented 6 years ago

I've been looking through the VZ concerns data from Boston, which is the only source we have at the moment. I don't have any numbers to back this up (maybe @j-t-t or @bpben could help out with that) but my instinct says that that this type of data, combined with our other sources could go a long way to predicting risk as different locations.

The data we have isn't very detailed. If we want to move beyond just assessing the number of concerns for a given segment and start to look at them in terms of severity and contribution to risk, we are going to need to do some data cleansing. The most structured aspects that I can see at the moment are REQUESTTYPE (around 20 possible values e.g. "people don't yield", "people speed", "it's hard to see") and USERTYPE (5 possible values - bikes, drives, walks etc.)

After reading through the comments for many concerns (some of which are hilarious by the way - "Everyone knows and says this intersection is just a disaster to drive or walk in. Hold a retreat somewhere out in the woods and come up with a way to fix it. It just doesn't work, especially for pedestrians.") I've been thinking of ways we might be able to assess them better. One option would be to do some language parsing of the comments looking for certain words, and using those matches to attach tags to a concern. For example:

"..hard to see.." = apply a poorVisibility tag "...drivers using bike lane..." = misuseOfBikeLane tag "...drivers do not stop..." = driversIgnoringSignage tag

and so on. I was able to easily come up with around 20 tags, and I'm sure there would plenty more that could be gleaned:

The idea would be to then define a standard for concerns that had an inbuilt open tagging system (similar to Open Street Maps). Every city will have their own unique problems that citizens want to report (think of cities in very cold climates, or with unusual public transport etc.) so it doesn't make sense for us to try and force a standard on them that has a fixed set of categories. But if we use an open tagging system, they can include whatever tags they feel are appropriate, and then it's just up to us to create an ontology that maps the tags we deem relevant into our own that can then be used to calculate & explain risk.

All of this though is based on the assumption that having concerns tagged in this way is going to help the project predict and explain risk, which of course is the one of the core aims. Any feedback and ideas welcome, thanks.

@alicefeng @andhint

j-t-t commented 6 years ago

I've been meaning to parse the text for a while, as well as do a nice writeup of my results It's really a notable finding that laypeople know what's not safe and their beliefs are highly predictive.

I think it's a good idea to make standardized tags.

One thing that I'm not sure about is that I thought I read somewhere that in addition to the website, Boston employees might have hit the streets to solicit concerns. Not sure where to check that though.

alicefeng commented 6 years ago

I like the tagging idea too.

Have you given any thought to the temporal aspect of this data? i.e., when the concern was submitted, how old it is, etc. Also, how are we handling any changes in status (like a concern gets closed or resolved)?

bpben commented 6 years ago

I think this is a great idea, and definitely would lead to interesting insights, but I think we should start with a few high-level splits rather than a long set of very granular splits. For example, having just "bike/ped-related" vs "car-related". My sense is more splits won't improve predictive power, i.e. there won't be enough sample to tease out any effect.

I think maybe this is part of a discussion we should have on feature additions, so we can prioritize ones that are really top of stakeholder mind and ones likely to boost model performance.

terryf82 commented 6 years ago

@j-t-t re. solicited concerns - do you think that would have much of an impact on concern quality? I'm inclined to proceed with treating them all the same for now, but perhaps we could add that into our standard (source: user-submitted / solicited) that one day might yield results.

@alicefeng temporal data and status are going to be important. The DC concerns data that Jenny found looks to be realtime (the last entry added was today) but Boston data stops in Feb 2017. All concerns from both cities are at status "Unassigned". Perhaps for now the best we can do is discard concerns beyond a certain age?

@bpben agree that going too far on tagging may not improve predictive power, but it could go a way to explaining the contributing factors of risk at a location, which I think is just as important as identifying high risk locations?

j-t-t commented 6 years ago

I don't think the quality would necessarily be different or even that it's worth breaking them down. I just wonder if it would be worth mentioning in the data collection methodology: it may be that you get a lot more responses if you solicit them, but I don't actually know.

terryf82 commented 6 years ago

@j-t-t @bpben @alicefeng Hoping to keep momentum on this one, here is a draft schema for concerns data, as well as a validated instance document with a few examples extract from the Boston concerns -

https://github.com/Data4Democracy/boston-crash-modeling/blob/data_standards/data_standards/concerns-schema.json

https://github.com/Data4Democracy/boston-crash-modeling/blob/data_standards/data_standards/concerns-instance.json

I went with a JSON schema for now, but if there's strong feelings against this we could look elsewhere. Repeating this is probably more for my benefit than others, but I think we want to design our ideal standard for concerns to be made available to the model, rather than feeling tied to the format(s) that we've been given data in so far. Conversion of existing data into the new standard should hopefully be straightforward (and would give me a python coding task that I could probably actually do at the moment!). Let me know what you think, thanks.

j-t-t commented 6 years ago

I have concerns around requiring this to be json for new cities. If you're writing a conversion script, I think it's fine for our standard to be internally in json, and letting cities give it to us in csv, so if that is what you're saying, I'm on board. But most cities appear to have this in csv already, and I'm concerned that if we're asking them to convert it to participate it might be an unnecessary barrier to entry.

Some people have proposed a csv schema here, not sure if it would be helpful: http://digital-preservation.github.io/csv-schema/csv-schema-1.0.html

alicefeng commented 6 years ago

This looks great @terryf82 ! Thanks so much for the work you've put into this.

I think it'll definitely be important for us to document our internal data standards but I agree with @j-t-t that from an external-facing perspective (i.e., a user guide), we'll want to write this up as what the city themselves needs to provide to our application.

terryf82 commented 6 years ago

@j-t-t we're on the same page. To implement this type of standard we would need middleware that can handle the conversions, both now and for the foreseeable future. But at some point I would like to think the project can have enough appeal that cities are willing to handle the data transformation themselves (or more likely, they pick up our standard as their starting point).

This speaks to a broader issue though and I'm interested to hear everyone's input - are JSON-based schemas like this how we want to work with data coming into (and going out) of the project? My experience has always been that CSVs provide a practical starting point, but getting off them as soon as possible is the way to go, because of issues around structure, encoding, validation, portability etc. At the moment I think we generate predictions into a csv too, which @alicefeng reads for the visualisation - what if those predictions were output in JSON? We could then serve them via an API that the visuals interact with as well as making them available to cities that want to include them in their own models & projects. Now seems like the right time to make these decisions, before things scale up much further. @bpben what are your thoughts on this?

j-t-t commented 6 years ago

I think that json-based schemas are a good way to move forward internally! I've been thinking we also might also want to move to geojson from the shape files too.

alicefeng commented 6 years ago

GeoJSON would be great from a viz perspective. That's what I'm using anyways.

@bpben and I had touched on the idea of an API too this weekend for the reasons you mentioned @terryf82 . Also, depending on the functionality we want, the viz might have to turn into a full blown web app with a proper backend at some point.

terryf82 commented 6 years ago

@j-t-t @alicefeng @bpben I've updated the concern schema -

https://github.com/Data4Democracy/boston-crash-modeling/blob/data_standards/data_standards/concerns-schema.json

so that concern.tags is now a required value, and must contain at least one item. I don't expect we'll learn much from concerns that have no detail associated with them, and I'm confident that we'll always be able to interpret at least one tag from a concern's summary.

@j-t-t assuming that everyone is happy with this as the draft standard and if I write the scripts to convert the existing VZ CSVs for Boston & DC, and SeeClickFix CSV for Cambridge into this format, how much work is it for you to swap to reading from validated JSON files rather than the CSVs when making the canonical dataset? I know the concern.tags data (requesttype in the csv) isn't presently used, so in the short-term I could just reuse the requesttype as a single tag while working on parsing of the descriptions. Thanks.

j-t-t commented 6 years ago

I think this is fine, as long as we'd be willing to update the schema later if, for example, other cities' concern data turn out to have new fields that we think are useful

It will be simple to convert to json, but we should think about what this means for the pipeline. Right now, the data generation pipeline takes csv files as we get them from the cities. Do we want to add separate step outside the pipeline process and have the pipeline take the json only, or do we want to add the convert script to the pipeline if we have csv files but not json files? At the moment at least, I think I'd prefer the latter.

j-t-t commented 6 years ago

Do you not see my comments ^ unless I add your handle to it @terryf82 ?

terryf82 commented 6 years ago

@j-t-t I got email notifications of both comments. I think after you've been mentioned once in an issue, you're added to the notifications list until you remove yourself. But out of habit I keep using @'s anyway =)

My initial thought was to have a "data transformation" phase that is decoupled from and happens prior to the current pipeline. This could let us eventually spin off the transformation to a separate service (even one that is self-serve via a browser) meaning the pipeline only has to concern itself with data that it knows to be compatible with its standards. What do you see as the advantages of keeping them together?

Definitely agree that this is just a starting point for the standard, version 0.1 perhaps, with the expectation that we will need to adjust as we scale to more cities (I'm looking forward to having that particular problem!)

j-t-t commented 6 years ago

I just think that the fewer steps the better. I'm skeptical that (most) cities will be willing to adopt the json standard, and that the best we'll be able to ask for is that they provide what we want in csv. And this is fine with your transformation script, but why make the cities execute 2 steps when they can execute one?

And we already have a pipeline that goes from csv to canonical dataset. It's not a problem to make the changes to let users either start with csv or with json (whatever they have), and end up with both a json file and the canonical dataset.

bpben commented 6 years ago

Again late to the discussion but great work on this so far, @terryf82. I think I agree with @j-t-t that we should be as flexible as possible with input. Worst case, it seems, we could just have part of the pipeline perform a transformation?

On the flexible note: Is requiring tags a good idea here? As I see it, we could maybe populate tags with another set of scripts, looking for specific words like "bike" or "pedestrians".

terryf82 commented 6 years ago

@j-t-t we're already using pandas.read_csv to import the raw concerns, so maybe we just need to give it an encoding=x param to solve the ascii encoding issue you mentioned?

j-t-t commented 6 years ago

Although we certainly can fix it downstream, I think as long as we have a transformation process, it should be fixed there.

terryf82 commented 6 years ago

Version 0.1 of the concerns standard is now complete and integrated -

https://github.com/Data4Democracy/boston-crash-modeling/blob/data_standards/standards/concerns-schema.json

I've emailed it to VisionZeroNetwork to see if they have any feedback/interest in discussing further, will let you all know when I hear back from them.