Method for calculating de-identified trip data (open data)

joanathan commented 4 years ago

Is your feature request related to a problem?

Trip data is useful for many reasons, including for planning and research purposes, such as those expressed in #136.

Trip data however carries a risk, as described in the Chicago TNP and Taxi Open Data Approach: “It has been recognized in scientific literature and news reports that even data without directly identifying attributes can be reidentified using other data sources. Specifically, data about an individual’s location at certain points in time can create a ‘fingerprint’ that can allow for re identification, as long as there is a separate dataset available containing parts of the fingerprint along with identifying fields.”
By design, MDS data does not include rider characteristics, e.g. rider’s name, date of birth, zip code, phone number, gender identification, or any other attribute related to the individual. The exception is the location and time of the trip related events, and in-trip route trace telemetry.

To further protect against re-identification of MDS data, LADOT is implementing a similar approach to the one outlined and used by Chicago for TNP and Taxi, and micromobility trip data. Here we propose a Trip Binning approach to further de-identified trip data and a new API to support serving it.

Describe the solution you'd like

Following the example of the submitted Metrics API #485, we propose a new de-identified trips API that could be implemented by either mobility Providers or regulating Agencies (based on data obtained from the Provider or Agency /trips endpoint). The single endpoint would be something like /trips/deidentified with a variety of search parameters.

The API would contain precise definitions for further de-identified trips. It could provide either private exchange between Providers and Agencies, or between either of those and more public feeds for e.g. academic researchers.

Is this a breaking change

No, not breaking

Impacted Spec

agency
provider

Describe alternatives you've considered

We considered having one /trips endpoint that has different permissions and scope whether to serve the granular or further de-identified trip information. We went with separating /trips from /trips/deidentified to create clear delineation between granular and further de-identified trip information.

In regards to methodologies to de-identify trip information, cities have taken different approaches whether through rounding the start and end location to broader geography, binning the time to a larger time interval, and/or applying k-anonymization. We went with a de-identification technique that puts forward the best aspects of these different approaches.

We tested approaches that go beyond spatial and temporal rounding, such as using k-anonymization methodologies, namely, Trip Binning, which originated from the City of Chicago, and Point Fuzzing by Louisville. The result of the comparison study here.

Additional context

I'd welcome suggestions for better naming for /trips/deidentified.

Special thanks to @schnuerle for developing the Point Fuzzing technique, and providing input to the comparison study, @nicklucius for developing the Trip Binning approach for Chicago, and @whereissean, @playground-julia, Sam Jackson who kickstarted and have been relentless in developing a standard for trip deidentification.

We hope to push the privacy conversation a step further and stimulate discussion with more members of the OMF community.

joshuaandrewjohnson1 commented 4 years ago

Consider reviewing/including the spatial and temporal binning approach described in the City of Minneapolis' Data Methodology.

johnclary commented 4 years ago

@schnuerle would you add a Privacy label to this one, please?

schnuerle commented 4 years ago

I love your work on this and the attention to detail and data analysis you've done. A big benefit of this is the ability to share this MDS derived data with others, like the public with open data and "more public feeds for e.g. academic researchers" like you mention.

I'd like to see this put into a Pull Request, defining how this would fit in at /trips/aggregate or something, and then a separate markdown page in Provider that explains how the calculations can be done in the most straightforward way. You can link to support docs if needed (and these could be incorporated into the spec, like the precedent we have with the State Machine Diagram).

Or maybe this is a stand alone tool first, something that can be made and published for people to use and try out and get feedback on before it becomes part of the MDS spec? It would take private MDS data and process it to this methodology. That would also help with defining the utility and use cases around this data processing, and have a group of cities/orgs pushing to support its inclusion.

I know you are using S2 in your example. I wonder if H3 may be a better system to use, and your thoughts on each one. My only reasoning is I like the more regular hexagons more than the irregular parallelograms.

Something else I noticed that this proposal does that differs from Louisville's. You end up specifying which point gets binned at which level in S2 (eg a granular zone or a catch all zone), but then the end user knows which points have been fuzzed to the catch all zone. In the Louisville example, since the polygon is not specified and only the corner point location, it is not known which points have been fuzzed, therefore (I think) making things more anonymous since any point could be fuzzed or not and the end user does not know which are.

PlannerOnTheGo commented 4 years ago

So DC (@sharades) is definitely interested in publicly-available, "privacy-safe" version of the MDS/shared mobility data for research and public consumption.

Some questions:

How does this relate to SharedStreets Mobility Metrics? DC (and others) are very interested in that approach to a "safe" dataset, notably the full range of metrics (routing, OD pairs, not just starts and ends).
Would this be a publicly-available MDS feed, not bundled in with a city's login?
It seems like this would be provider-specific, but you could get more detail if you combine from multiple providers across a jurisdiction. Is there a way to allow for aggregation? Seems like you'd need to run that after putting the files together, so more likely would need to occur on the city side, not the provider side.

schnuerle commented 3 years ago

@PlannerOnTheGo I'll try and answer some questions based on my thoughts, but welcome @joanathan and others to join in.

I think it differs from SharedStreets in that it's fuzzed lat/lon coordinates, not points snapped to street segments. And I don't think SharedStreets is meant to be public - it's for internal use only.
I think the city could serve this up as a public API, much like the proposed other public APIs for Geography and Policy. But it would require some data infrastructure and processing. It may be easier to hit it monthly and extract the resulting data to an open data site. Or a city could rely on a third party. Louisville does a very similar open data approach (mentioned in this issue and discussed in the methodology comparison) with the help of Stae which you can see in action, filter and export here.
I don't think it's clear here, but I think this proposal says the provider ID would come across with the fuzzed data. That would be ok if it's authenticated and for a city, but not for publishing to the public, for competition concerns. I think aggregation across providers would happen, and in the final data you could choose to not publish the provider info, just that a trip happened. Again, that's what happens in Louisville: provider ids are removed so you don't know which of the 5 provider devices a trip was made on.

schnuerle commented 2 years ago

I'm re-upping the conversation on this since the OMF Privacy Committee is thinking about this as a future priority and work-stream. Today was a public Privacy Committee meeting on the topic of open data and MDS with presentations from DC, DRGOG, PBOT, Ride Report and a rundown of methodologies from the State of Practice. There was interest in a standard around this, and some way to publish a suitable open data report to allow standard program comparisons across cities as well. I think this could go within the Metrics API as a new endpoint, much like the Provider API Reports endpoint.

schnuerle commented 2 years ago

Of note is this new method of data processing brought to my attention by the students in the UPenn MUSA program: https://arxiv.org/abs/2205.08886

schnuerle commented 4 months ago

A resource from 2022 from TIER-Dott about how they anonymize GBFS data, which has some relevance to MDS data.

https://tier.engineering/How-we-anonymize-user-trips-on-public-APIs

schnuerle commented 4 months ago

Not sure this needs to be an API as stated. Instead it could be a calculation based on MDS data received by a public agency that can de-identify that trip data consistently to be published as open data.

openmobilityfoundation / mobility-data-specification