mozilla / ichnaea

Mozilla Ichnaea
http://location.services.mozilla.com
Apache License 2.0
563 stars 140 forks source link

Retiring the Mozilla Location Service #2065

Open mostlygeek opened 3 months ago

mostlygeek commented 3 months ago

The accuracy of Mozilla Location Service (MLS) has steadily declined. With no plans to restart the stumbler program or increase investments to MLS we have made the decision to retire the service. 

In 2013, Mozilla launched MLS as an open service to provide geolocation lookups based on publicly observable radio signals. The service received community submissions of GPS data from the open source MozStumbler Android app. In 2019, Skyhook Holdings, Inc contacted Mozilla and alleged that MLS infringed a number of its patents. We reached an agreement with Skyhook that avoided litigation. This required us to make changes to our MLS policies and made it difficult for us to invest in and expand MLS. In early 2021, we retired the MozStumbler program. 

We are grateful for the contributions of the community to MLS to both the code and the dataset. To minimize disruptions and allow people time to make alternative arrangements, we have created a schedule that implements the retirement in stages. The retirement plan can be tracked in this issue. 

There will be five stages. 

  1. As of today (Mar 13th, 2024) we will stop granting new API access keys. All pending applications will be rejected. 
  2. On March 27th, 2024 we will stop accepting POST data submissions to the API. All submissions will receive a 403 response and the submitted data will be discarded. Additionally, we will stop publishing new exports of cell data for download. 
  3. On April 10th, 2024 the cell data downloads will be deleted and will no longer be available. 
  4. On June 12th, 2024 third party API keys will be removed and the service will only be available for Mozilla’s use cases. 
  5. On July 31st, 2024 this source repo (https://github.com/mozilla/ichnaea) will be archived. 

The source code for the MLS application, Mozilla Ichnaea, will remain available under the Apache License 2.0.

heftig commented 3 months ago

Firefox still uses MLS for browser.region.network.url; will that also move to Google Location Services?

alexcottner commented 3 months ago

Firefox still uses MLS for browser.region.network.url; will that also move to Google Location Services?

This endpoint will be migrated to another service (classify-client) that will return the expected response. We'll adjust DNS entries when it's time to make that move so firefox won't see any difference.

heftig commented 3 months ago

Will downstream builds of Firefox need to obtain API keys for this new service?

rtsisyk commented 3 months ago

Does anyone want to collaborate to run MLS in some jurisdiction that isn't concerned about patents? Drop a message to roman@organicmaps.app

mar-v-in commented 3 months ago

@rtsisyk

Do we have any insights on which patents are applicable in this context and which jurisdictions they apply in? As you might know, Europe is generally far less into software related patents, so maybe the "some jurisdiction" could be very easy to find.

The other question would be if Mozilla is willing to handover the WiFi dataset to a new organization running an ichnaea server or if we'd have to start from scratch.

DylanVanAssche commented 3 months ago

Also the Bluetooth beacons would be useful. I know Mozilla did not publish the WiFi and Bluetooth beacons for privacy reasons, but handing them over would be super beneficial. We would be set back by a decade if we have start from scratch.

thestinger commented 3 months ago

GrapheneOS Foundation has been planning to host a network location service for GrapheneOS and projects collaborating with us for a while now. We've received significant funding we can put to use for this to make a high quality, modern implementation on both the client and server side. A new unified app (cellular, Wi-Fi, Bluetooth beacons) for gathering data to publish as fully open data could also be part of it. We also plan to make a SUPL implementation as part of the same service as an alternative to our Google SUPL proxy to replace it as the default in the long term.

The main issue is obtaining high quality and reliable data to run the service. It's necessary to get a lot more users helping with it than Mozilla had submitting data. It seems to have been dying off for a while now. Simply accepting all user submissions of data also makes it easy to poison the database, which becomes a major concern for a less research focused project that's meant to be widely used in production. We have a plan for dealing with this using hardware-based attestation able to support submitting data from any modern Android device with the security model intact and recent security patches. Other mitigations are also needed. OpenCelliD exists but it would be very easy for people to ruin it if they haven't already, and we're dealing with those kinds of attacks on a regular basis so we know we can't use something not resistant to it. The experience of users with the existing services indicates to us that the existing data is problematic.

We're not aware of sources for Wi-Fi and Bluetooth data. It would be best to gather it from a unified app handling all of it with the resulting database being completely open data. We've never understood the privacy concern with providing a map of networks that are publicly broadcasting and are meant to be static landmarks. If they move around regularly, they shouldn't be included anyway, and those should be possible to distinguish. It seems highly unlikely that Mozilla would pass on this data to anyone else based on how they see it. They also may be forbidden to do that from the settlement they reached.

The key is figuring out how to get a large number of users to run an app submitting cellular, Wi-Fi and Bluetooth beacon data in a way that it's difficult enough to mess with it maliciously. The people who care about this tend to care about their privacy and probably don't want to be submitting this kind of data during regular use of their devices, especially if the data is kept tied to the accounts submitting it for anti-abuse. Publishing it all as open data available for any purpose is crucial for getting people to be interested in submitting the data. Anyone being able to use for anything without constraints will give a lot of incentive to contribute to it. Mozilla wasn't making the most valuable data available to others. Having that data public also allows people to use it locally, which is important for privacy. Sending your location in real-time to a service isn't great regardless of who is hosting it even if you can do it without an account via a VPN / Tor, particularly if relatively few people use it.

We're not at all happy with the existing approaches in this area and if anyone wants to build something better, we're interested in funding serious work on building it as remote work by people with experience working on similar things. This has been something we haven't considered a near term project but if there's suddenly a bunch of interest in it, perhaps it could be.

Relying on another centralized service that's not publishing the data would be a shame.

badrihippo commented 3 months ago

What's the plan for KaiOS devices? IIRC they rely on MLS for location. I hope mine continues to function!

vfosnar commented 3 months ago

@thestinger

We have a plan for dealing with this using hardware-based attestation able to support submitting data from any modern Android device with the security model intact and recent security patches.

This feels bad. Can't see this implemented in a open source/community friendly way.

We've never understood the privacy concern with providing a map of networks that are publicly broadcasting and are meant to be static landmarks.

+1

The people who care about this tend to care about their privacy and probably don't want to be submitting this kind of data during regular use of their devices, especially if the data is kept tied to the accounts submitting it for anti-abuse.

Hot take, I'm okay with that. Trying to fight anti-abuse when only collecting anonymous data is impossible. I would data connected to my account as long as I can trust the provider (clear privacy policy, legal limitation on the use of provided data) and a good anonymization process of the data (wifi endpoints only published after multiple accounts have seen them independently) and obviously if both backend and clients are open source.

badrihippo commented 3 months ago

Hot take, I'm okay with that. Trying to fight anti-abuse when only collecting anonymous data is impossible. I would data connected to my account as long as I can trust the provider (clear privacy policy, legal limitation on the use of provided data) and a good anonymization process of the data (wifi endpoints only published after multiple accounts have seen them independently) and obviously if both backend and clients are open source.

I'm thinking of this as an "OpenStreetMap of location data". When people contribute locations to OSM, they are by nature implying that they were in the area—and they're fine with that.

Of course, location data is very different because the sharing process will likely be automated, not submitted through a conscious process. I can't speak for everyone, but I personally would be happy to submit this data to a trusted organisation. To make things even better, perhaps the account data can be de-linked after a reasonably long time, when most of the anti-abuse use-case has expired.

cookieguru commented 3 months ago

We're not aware of sources for Wi-Fi and Bluetooth data

wigle.net

oguz-ismail commented 3 months ago

good firefox next

thestinger commented 3 months ago

We're not aware of sources for Wi-Fi and Bluetooth data

wigle.net

Unfortunately, this isn't open data for any usage. It's not clear what they would allow but it's non-commercial usage only which somewhat implies not making a service usable for anything including commercial usage.

Goldmaster commented 3 months ago

I am a little surprised as it helps improve accuracy when GPS is not very good. It doesn't help that the option to help map was removed from the Web browser rewrite and that the stumbler app didn't get updated or made compatible with modern android versions.

thestinger commented 3 months ago

@vfosnar

We have a plan for dealing with this using hardware-based attestation able to support submitting data from any modern Android device with the security model intact and recent security patches.

This feels bad. Can't see this implemented in a open source/community friendly way.

Data could be accepted from anywhere and simply not trusted without a way to confirm, with this being one way to show it has high confidence. It's possible to support any alternate OS as long as it implements a security model where the app can have strong confidence that it's not getting poisoned data. Apple and Google have a massive amount of data being submitted and can use lots of signals to determine if it's valid too. For a service with very little data, it's very easy for people to mess with it submitting poisoned data. It's too easy to submit fake data to a service like OpenCelliD, or even more so with Wi-Fi networks and Bluetooth beacons. That brings the overall data into question. Making it significantly harder will get rid of nearly all the poisoned data even though it's still theoretically possible to do it. The rest can be handled with moderation.

Hot take, I'm okay with that. Trying to fight anti-abuse when only collecting anonymous data is impossible. I would data connected to my account as long as I can trust the provider (clear privacy policy, legal limitation on the use of provided data) and a good anonymization process of the data (wifi endpoints only published after multiple accounts have seen them independently) and obviously if both backend and clients are open source.

There could be an opt-in to the level of privacy provided where the default could be something like not using the data until 3+ accounts have seen the networks, but with the option for less. Hardware attestation can give high confidence in the data being valid without needing to confirm it from several accounts believed to be separate people, and any app can use hardware attestation without any special privileges since it's privacy preserving, so I really think that's a good way to avoid needing to do something privacy invasive to confirm accounts are legitimate.

cookieguru commented 3 months ago

We're not aware of sources for Wi-Fi and Bluetooth data

wigle.net

Unfortunately, this isn't open data for any usage

That is incorrect, the EULA very explicitly states that you have a "right to use the maps and access point database...solely for your personal, research or educational, non-commercial purposes"

It's not clear what they would allow but it's non-commercial usage only which somewhat implies not making a service usable for anything including commercial usage.

Your misuse of the apostrophe makes it difficult to decipher the intent, but if you're using the data for commercial purposes then you by definition have the funds to license the data. Whether or not it the license fees are feasible for your project can only be speculated.

It certainly sounds like you're forging ahead on this path

thestinger commented 3 months ago

@cookieguru

That is incorrect, the EULA very explicitly states that you have a "right to use the maps and access point database...solely for your personal, research or educational, non-commercial purposes"

That's what I said: it's not open data usable for any purpose. Non-commercial usage restriction would prevent hosting a service which can be used for commercial usage itself. Normally, you can't simply bypass a license wrapping it behind something. They'd need to be asked what they would permit and there's no way to make an open data service from licensing proprietary data. We want anyone to be able to host it.

Your misuse of the apostrophe makes it difficult to decipher the intent, but if you're using the data for commercial purposes then you by definition have the funds to license the data. Whether or not it the license fees are feasible for your project can only be speculated.

Writing "it is" as "it's" twice in a row is not misuse of the apostrophe, and I'm not sure how it would make it harder to understand.

but if you're using the data for commercial purposes then you by definition have the funds to license the data

Publishing a service usable by anyone for any purpose for free is likely considered commercial use of the data because people would be using the service for commercial purposes. Cannot simply wrap it up in a non-profit service that's used commercially. They may also have an issue with taking donations to support a service.

We want an open source service with open data. Paying for data with heavily restricted usage terms wouldn't work.

It certainly sounds like you're forging ahead on this path

Mozilla's location service wasn't even an implementation of what we want because it was no longer maintaining / promoting data submission and the Wi-Fi/Bluetooth data wasn't open. It has degraded over time. I hadn't seen the legal situation it was in before yesterday and that partially explains the situation. I think it would have rotted away and been discontinued either way, but it probably accelerated that. I think the death of the service was inevitable with FirefoxOS. Cannot really expect them to maintain a service they don't need and which has little to do with their main project.

Wigle is a proprietary service with proprietary data. Making an open source and open data service is not reinventing the wheel. Mozilla's service had largely proprietary, non-published data too, for the Wi-Fi/Bluetooth part of it. The service itself and the cell data were published.

maciejsszmigiero commented 3 months ago

Note that Geoclue has always allowed submitting data to MLS from systems where it can access a GNSS receiver.

It is opt-in, however, both for privacy reasons and because that's what our users expect.

woj-tek commented 3 months ago

Hot take, I'm okay with that. Trying to fight anti-abuse when only collecting anonymous data is impossible. I would data connected to my account as long as I can trust the provider (clear privacy policy, legal limitation on the use of provided data) and a good anonymization process of the data (wifi endpoints only published after multiple accounts have seen them independently) and obviously if both backend and clients are open source.

There could be an opt-in to the level of privacy provided where the default could be something like not using the data until 3+ accounts have seen the networks, but with the option for less. Hardware attestation can give high confidence in the data being valid without needing to confirm it from several accounts believed to be separate people, and any app can use hardware attestation without any special privileges since it's privacy preserving, so I really think that's a good way to avoid needing to do something privacy invasive to confirm accounts are legitimate.

my 3c: I try to be privacy focused but at the same time I do (try to) contribute to openstreetmap, which requires account. I think there could be a group of people that wouldn't mind using account to help collect the data (assuming it would be only used for verification during submission and would be anonymised/erased)

aembleton commented 3 months ago

my 3c: I try to be privacy focused but at the same time I do (try to) contribute to openstreetmap, which requires account.

The data collection could even be build into something like Street Complete. As you're using it to update OSM, it could listen for SSIDs and feed those back. It's already using the GPS for Street Complete and so wouldn't hurt the battery life.

mar-v-in commented 3 months ago

For users that are very privacy aware, I suggest to prefetch the cell towers in their region and not use accurate wifi-based location at all. Prefetched cell tower data would also be sufficient for faster GPS via local SUPL, which doesn't need high accuracy and can handle inaccuracies of tens of kilometers pretty fine.

I doubt that protecting a location service with device attestation makes a lot of sense for two reasons:

  1. You want people that are able and willing to collect wifi locations using highly sensitive long distance hardware to do that and contribute. This hardware is very likely not compliant with any attestations, but can be an extremely good source.
  2. Even when device attestation is used, it is "trivial" to introduce wrong records, as we're talking about uploading data that was previously retrieved from public radio airspace. Setting up a device that appears as a hotspot with arbitrary strength and mac address is easy enough for people that are willing to abuse.

Publishing the raw data (both raw data of submissions and raw locations of wifi hotspots) can be a serious privacy issue - entirely independent of what privacy laws might state. As you are concerned with abuse, please also consider the abuse risk of this data. For example, stalkers will be able to follow their victims even when they move to a different location for as long as they continue to use the same wifi hardware.

Requiring an account for uploading data is not generally a bad idea, however this account setup should be as easy as possible (e.g. could be just generating a unique random id client side). If passive contributing (that is: uploading data about wifi networks when GPS is already in use anyway) is easy to activate (e.g. just ticking a box) this would allow users to contribute without any downsides. A unique random id would still be useful for abuse prevention, as continued and proven correct submissions from a unique id could be used to increase trust in its data - so the "account" would only be used to gain trust, not to ban bad actors (which is hard anyways, as they will likely be able to create a new identity).

thestinger commented 3 months ago

I doubt that protecting a location service with device attestation makes a lot of sense for two reasons

Any service we provide needs to be heavily protected from abuse, as do the people involved in it. All of our services are targeted with attacks including denial of service, Child Sex Abuse Material spam, gore spam and harassment. If the service has any form of comments or discussions related to it, all of that is relevant too.

This gives us the advantage of already being prepared for the abuse that will eventually target a service like this once it becomes popular. We're already thinking about it and preparing for it before even starting.

You want people that are able and willing to collect wifi locations using highly sensitive long distance hardware to do that and contribute. This hardware is very likely not compliant with any attestations, but can be an extremely good source.

It's possible to have both the concept of trusted contributors who have an established account with a history of submitting valid data and can apply to submit data from a non-verified device. The quality of the resulting data heavily depends on not simply accepting any data submission and treating all data submissions as equal. The data submitted by different hardware will also vary in how it needs to be treated.

Even when device attestation is used, it is "trivial" to introduce wrong records, as we're talking about uploading data that was previously retrieved from public radio airspace. Setting up a device that appears as a hotspot with arbitrary strength and mac address is easy enough for people that are willing to abuse.

Hardware attestation provides a baseline which allows deploying other mitigations against abuse. People doing this also have a limit to how much money they're willing to spend on new phones.

Publishing the raw data (both raw data of submissions and raw locations of wifi hotspots) can be a serious privacy issue - entirely independent of what privacy laws might state. As you are concerned with abuse, please also consider the abuse risk of this data. For example, stalkers will be able to follow their victims even when they move to a different location for as long as they continue to use the same wifi hardware.

Wigle supports lookups by SSID/MAC already:

https://wigle.net/

Therefore, it doesn't seem to be a valid reason to oppose publishing open data. Many people believe in gathering and publishing this as data in the public domain usable for any purpose. A service making the data available to everyone has a much higher chance of success than one hoarding it and building a business model around it. That's not what we want to do. Submitting data to companies profiting off it without compensating the people submitting it or giving them rights to the overall data doesn't seem right.

Some of the main perpetrators of the attacks on GrapheneOS and myself are among the main developers of your project. It was one of your supporters who did multiple swatting attacks targeting me in April 2023 with the clear aim of having me killed by law enforcement. It was carefully crafted to maximize the chance of that outcome. Perhaps if was one of the people who makes commits to your project who did it. You probably wouldn't kick them out even if we obtained proof of that based on past experience. Doesn't make any sense for someone like yourself who says you only care about the code and will allow nazis to contribute to your project if they write good code to start pretending to care about abuse.

mar-v-in commented 3 months ago

Wigle supports lookups by SSID/MAC already:

https://wigle.net/

Therefore, it doesn't seem to be a valid reason to oppose publishing open data. Many people believe in gathering and publishing this as data in the public domain usable for any purpose. A service making the data available to everyone has a much higher chance of success than one hoarding it and building a business model around it.

Someone else doing it this way doesn't mean it's a good idea ;) There likely are reasons why the database of Mozilla had many more contributors even if the Wigle database is more open. I personally would not want to contribute to a database that can be easily abused by people with malicious intent.

And I wasn't suggesting to make a business model out of it, nor to "hoard" it. Making data not open to everyone is not the same as keeping data private. Data could be provided to others under clear rules that prevent or make abuse sufficiently hard, while not impacting its usefulness. For example, a rule could be that a full dump of raw data is provided to researchers, if usage is monitored by an IRB and all copies of the data is removed within a year after the research was conducted. I'm not saying it's easy to come up with appropriate rules for this, but I would argue it would be worth it to prevent abuse.

Off-topic: on accusations As I previously said, I feel sorry for what happened to you and you are open to provide detailed accusations and supporting material by email to me, e.g. by mail to admin at microg.org, or to some independent third-party. And of course these can lead to a ban of individuals from the project. I don't think it's fair to claim wrongdoings by me personally because of a conflict you seem to have with an unnamed contributor to my project. I'd also like to point out that the project only has a single "main developer" beside myself and I doubt you are referring to that person, so maybe you should downgrade your wording to "minor contributor". And to reiterate the obvious: I don't want any contributions from Nazis and will happily ban those, no matter how good or how much code they were to contribute. But I also don't think this is the venue to discuss any of this.
matchboxbananasynergy commented 3 months ago

Wigle supports lookups by SSID/MAC already: https://wigle.net/ Therefore, it doesn't seem to be a valid reason to oppose publishing open data. Many people believe in gathering and publishing this as data in the public domain usable for any purpose. A service making the data available to everyone has a much higher chance of success than one hoarding it and building a business model around it.

Someone else doing it this way doesn't mean it's a good idea ;) There likely are reasons why the database of Mozilla had many more contributors even if the Wigle database is more open. I personally would not want to contribute to a database that can be easily abused by people with malicious intent.

I think the benefits of properly open and accurate data are very important for this to at least not be thought about. There will be concerns of abuse, both from potential contributors to the database, as well as what can be done with the dataset itself. Just like you can brainstorm and come up with potential mitigations for reducing or eliminating abuse from the contributor's end of things, you can do the same for the other side of the coin as well.

As an example of a very crude idea - that to be quite clear I'm not sure if it would be effective - is this: Wi-Fi beacons are meant to be static landmarks. If a static landmark has been recorded in the dataset previously and then moves to a different location, when an attempt is being for it to be entered in its new position, it could instead be rejected. I'm sure much smarter people than me can come up with proper solutions to any concerns much better than I (as someone who's not technical) can in 5 minutes.

And I wasn't suggesting to make a business model out of it, nor to "hoard" it. Making data not open to everyone is not the same as keeping data private. Data could be provided to others under clear rules that prevent or make abuse sufficiently hard, while not impacting its usefulness. For example, a rule could be that a full dump of raw data is provided to researchers, if usage is monitored by an IRB and all copies of the data is removed within a year after the research was conducted. I'm not saying it's easy to come up with appropriate rules for this, but I would argue it would be worth it to prevent abuse.

From my perspective, it feels like you're essentially arguing yourself into finding yourself in the same position down the line. The proper solution for creating a dataset that's useful and will continue to be useful would be for it to be open. It wouldn't make sense to essentially repeat the model that has brought us to this position in the first place. Having a dataset that's resistant to abuse and is licensed in a way where it's usable by everyone for every purpose means that there will be incentive for people to contribute to it, and that this data won't risk being lost again resulting in everyone having to start over.

My suggestion would be working through the problems, not repeating what has landed us in this position in the first place.

thestinger commented 3 months ago

Someone else doing it this way doesn't mean it's a good idea ;) There likely are reasons why the database of Mozilla had many more contributors even if the Wigle database is more open. I personally would not want to contribute to a database that can be easily abused by people with malicious intent.

It means someone can already look up a MAC / BSSID in a huge database. The data isn't open for anyone to use for any purpose but it is available to use for the abusive purpose you mentioned. Making open data available won't create a new form of abuse. The data is already usable for that. It's not usable for a lot of other purposes, but the stalking method you brought up is supported already. Making open data available won't create a new problem when several companies provide this data. It's also possible to do the stalking method you mentioned by simply submitting a single Wi-Fi network if the service allows detection based on a single network, and why shouldn't it? It's already possible to look them up on proprietary services.

Mozilla is a well known organization and at least historically widely trusted. They used to be promoting their tool for gathering data. Most people also likely didn't realize that it was a proprietary database. It being shut down demonstrates one of the drawbacks of it being a proprietary database. Inability to use the data locally without leaking location to a service is another major drawback. The service being hosted by a provider someone trusts doesn't change that it could be compromised by an insider, an exploit, infrastructure compromise, etc. to obtain valuable real time tracking of users. Not having open data is a choice to force this approach.

The same thing can happen again if the replacements do not have open data. If the data is open, a company cannot stop it with patent threats because anyone can host it and the data can be submitted anywhere by the people gathering it. This is also further reason to have solid ways of mitigating poisoned data so that the data is more portable between different services. An organization hosting a new service may claim that they'll stand up to patent threats but they may lose and then the data will once again be lost if it's not available. It means a small group of people get to determine things for everyone even though they weren't responsible for gathering the vast majority of the data, etc.

Making the same mistake of having another proprietary database and therefore overall proprietary service doesn't seem wise. It also prevents having local copies of the Wi-Fi database. It should be possible to download the database for an area to avoid needing to send your location to a service. The data belongs to the people willing to gather it and it makes sense to submit it to servers which respect them and give them access to the resulting database. Services which insist on being the only ones with the data and mandate that users send their location to the service in order to use it are highly questionable.

Wi-Fi APs are broadcasting over public airwaves and are already heavily mapped. Wi-Fi AP locations aren't private and won't really become any less private through an open data mapping project existing.

And I wasn't suggesting to make a business model out of it, nor to "hoard" it. Making data not open to everyone is not the same as keeping data private. Data could be provided to others under clear rules that prevent or make abuse sufficiently hard, while not impacting its usefulness. For example, a rule could be that a full dump of raw data is provided to researchers, if usage is monitored by an IRB and all copies of the data is removed within a year after the research was conducted. I'm not saying it's easy to come up with appropriate rules for this, but I would argue it would be worth it to prevent abuse.

It's inherently a business model if anyone is being paid to work on it which is the only reasonable way for it to happen. Having it done by a non-profit or without an incorporated entity doesn't change that people are financially benefiting from it based on being the only ones with access to the data. The people submitting the data should have access to it under the same license. It's ultimately the people willing to do the work of collecting the data who hold the power. They can choose to submit to an open database service instead of a proprietary database service where a small group of people hold all the power and can choose to shut it down with all their work destroyed.

I don't think it's fair to claim wrongdoings by me personally because of a conflict you seem to have with an unnamed contributor to my project.

These are not unnamed contributors to the project but very active developers. You're also paid by people involved in this. You brought up the possibility of abuse if we published open data for publicly available information, and yet you've had no issue with harassment which financially benefits you and your project. You've heavily spread misinformation yourself. No point of acting uninvolved in it. You share responsibility for the harassment including swatting attacks. If you bring up abuse and imply other people are going to be enabling it, this is very relevant.

Sapiosenses commented 3 months ago

Publishing the raw data (both raw data of submissions and raw locations of wifi hotspots) can be a serious privacy issue - entirely independent of what privacy laws might state. As you are concerned with abuse, please also consider the abuse risk of this data. For example, stalkers will be able to follow their victims even when they move to a different location for as long as they continue to use the same wifi hardware.

Requiring an account for uploading data is not generally a bad idea, however this account setup should be as easy as possible (e.g. could be just generating a unique random id client side). If passive contributing (that is: uploading data about wifi networks when GPS is already in use anyway) is easy to activate (e.g. just ticking a box) this would allow users to contribute without any downsides. A unique random id would still be useful for abuse prevention, as continued and proven correct submissions from a unique id could be used to increase trust in its data - so the "account" would only be used to gain trust, not to ban bad actors (which is hard anyways, as they will likely be able to create a new identity).

I am not particularly comfy with the idea of HW attestation for contributions. And I agree about the potential for abuse of user-generated RF emitter/geocoordinate data but I think there are ways this could be minimized. No time to expand on that rn unfortunately, mostly wanted to post my thoughts on the collection/validation of data below.

Given that we already have UNLP backends that build on-device RF emitter databases based on approximate coordinates using a concurrently available GPS fix, seems to me that there could be some automated ways to validate that data without requiring all these draconian submitter requirements that, as you mention, could also be intentionally poisoned.

Just add a function to the UNLP backend in GMSCore that if the device has a current GPS fix and the reported geocoordinates of an RF emitter seem grossly inaccurate based on that, have a way to send an anonymized "vote" back to the data server that that emitter source seems untrustworthy at that moment. Only after a minimal additional number of subsequent "sightings" verified the estimated location with their concurrent GPS fix, would it return to "validated" status in the database.

thestinger commented 3 months ago

Hardware attestation on Android is privacy preserving and therefore requires no permission. It uses an app-specific provisioned certificate on modern devices. On older devices, it uses a batch key provisioned to at least 100k devices as a minimum. It can simply be one signal among many that data is valid and makes other signals stronger.

Given that we already have UNLP backends that build on-device RF emitter databases based on approximate coordinates using a concurrently available GPS fix, seems to me that there could be some automated ways to validate that data without requiring all these draconian submitter requirements that, as you mention, could also be intentionally poisoned.

We'll be implementing our own client and server software. We don't and won't use that.

Just add a function to the UNLP backend in GMSCore that if the device has a current GPS fix and the reported geocoordinates of an RF emitter seem grossly inaccurate based on that, have a way to send an anonymized "vote" back to the data server that that emitter source seems untrustworthy at that moment. Only after a minimal additional number of subsequent "sightings" verified the estimated location with their concurrent GPS fix, would it return to "validated" status in the database.

That kind of anti-abuse mechanism seems prone to abuse itself. People can fake being many different users if there isn't some way of preventing that.

Djfe commented 3 months ago

This honestly leaves me feeling devastated. I'm staring at the screen with a bit of a blank expression 😮‍💨 it doesn't come by surprise, no, but I was actually still contributing from time to time using the Tower Collector app (even though this felt limited without wifi scans). I was always seeing the broader appeal of providing this type of service as open (source) to counter Google, Wigle etc. who don't give a * about privacy and didn't give you geofencing or whatever.

Please find a way to not throw the data away that we collected all these years, even if lots of stuff is outdated now. It would mean a lot honestly! I hope my message find's you well. I hope you see that MLS is actually cared about still. Also: software patents are a b-word and I root for Mozilla for also being part of the movement against it. I'm actually wondering: was moving MLS to European soil where software patents don't exist never a legal option?

is there a possibility to move the data into a new project with a new owner if anyone would care to step up and you are confident about them not abusing the data? or is that impossible due to copyright and such a company/legal entity not being Mozilla? (user only allowed Mozilla to use the data probably?)

Big thank you to anyone working on this back in the day. I really had a great time on github contributing in my way and I always felt appreciated by the devs. that was so awesome. MLS was for 3-5 years a big part of my hobbies. I used to cartography my neighborhood, city etc. in this way to contribute. This felt important to me growing up as a teenager and the project gave me firm beliefs into the way and power of open source. It was the project that taught me commands like git rebase even if it was only to improve the German translation of the MozStumbler app 😅

I'm having quite the nostalgic moment right now.

Edit: Dunno if it means something to anyone here, but I'm actually mourning this terrible news 😭 Not specifically for the service or because I require using it but for the social meaning this had to me back then. This is the end of an era to me, all the time I spend on this. not because it was for naught but it comes crashing back to me, because this was exciting me so much back then. I started bawling my eyes out when I first hit send on this post and I'm usually not this emotional.

thestinger commented 3 months ago

The next service should have open data so that future work is not wasted again. It's public data broadcast over the air. The real privacy concern is sending your location in real time to a service because you can't download open data to use locally on your device for location detection. It seems highly unlikely that they'll even be able to pass along the data due to the legal issues and the agreement they came to in order to resolve it. Passing along the data to someone else to run it instead in another jurisdiction would be a huge hole in the legal agreement. It's possible the legal agreement has a huge loophole, but I doubt it. Unless that huge loophole exists, the data will need to be gathered again. The same mistake of a proprietary database should be avoided. There's no need to trust a centralized service and be unable to do local location detection or self-hosting a server. It's public data broadcast over the air and can be part of a public database usable by anyone.

People can already look up a MAC on a proprietary service for free or obtain access to a proprietary database, also likely for free. This should not block creating an open database when proprietary databases usable for it already exist. Would also need to avoid giving responses with a single Wi-Fi network in the query too, which would lower the usefulness.

Sapiosenses commented 3 months ago

We'll be implementing our own client and server software. We don't and won't use that.

This is a microG issue, my comments pertain to microG.

You can do whatever you want.

That kind of anti-abuse mechanism seems prone to abuse itself. People can fake being many different users if there isn't some way of preventing that.

The likelihood of anyone doing that and and especially succeeding with it, is very low.

I often spend time in places where there are hundreds of detectable RF emitters operated by both private and public entities, individuals and organizations. It's literally impossible for a malicious party to compromise all of them, and if one or two were compromised either internally or via using their MAC associated with a bogus set of geocoordinates it would be trivial to see that those are outliers whose data should be thrown out, especially as many other "viewers" would be passing by on a daily basis, throwing that data into question.

Asking for draconian requirements for a crowdsourced collection of data would be like wikipedia demanding full legal government ID for all submitters to their database. They have come up with ways to ensure that the resulting product is not grotesquely incorrect, without imposing such draconian requirements on their contributors and vastly decreasing the number of people willing to contribute.

I doubt any reasonable person would expect or demand that every single record in that db be 100% accurate all the time. Case in point is how long people have put up with the Mozilla db despite the fact it has become extremely inaccurate for probably at least 2 years now. (That's how long ago I stopped using it for that reason)

Neither is HW attestation going to guarantee that accuracy.

Sapiosenses commented 3 months ago

We'll be implementing our own client and server software. We don't and won't use that.

This is a microG issue, my comments pertain to microG.

You can do whatever you want, obviously.

That kind of anti-abuse mechanism seems prone to abuse itself. People can fake being many different users if there isn't some way of preventing that.

The likelihood of anyone doing that and and especially succeeding with it, is very low.

I often spend time in places where there are hundreds of detectable RF emitters operated by both private and public entities, individuals and organizations. It's literally impossible for a malicious party to compromise all of them, and if one or two were compromised either internally or via using their MAC associated with a bogus set of geocoordinates it would be trivial to see that those are outliers whose data should be thrown out, especially as many other "viewers" would be passing by on a daily basis, throwing that data into question.

Asking for draconian requirements for a crowdsourced collection of data would be like wikipedia demanding full legal government ID for all submitters to their database. They have come up with ways to ensure that the resulting product is not grotesquely incorrect, without imposing such draconian requirements on their contributors and vastly decreasing the number of people willing to contribute.

I doubt any reasonable person would expect or demand that every single record in that db be 100% accurate all the time. Case in point is how long people have put up with the Mozilla db despite the fact it has become extremely inaccurate for probably at least 2 years now. (That's how long ago I stopped using it for that reason)

Neither is HW attestation going to guarantee that accuracy.

matchboxbananasynergy commented 3 months ago

What exactly is "draconian" about using a modern device with the security model intact? Hardware attestation doesn't pose a privacy risk, so it's unclear what's horrible about it. Also, I believe it has been stated multiple times in this thread that it doesn't necessarily have to be the only way to curb abuse, and it doesn't even necessarily have to be a prerequisite to submit data to the database. There could be different trust levels based on how the data is being submitted, and there can even be variants of the database that can be deployed base on what "trust" level the person deploying it for their projects' uses sees fit.

thestinger commented 3 months ago

I often spend time in places where there are hundreds of detectable RF emitters operated by both private and public entities, individuals and organizations. It's literally impossible for a malicious party to compromise all of them, and if one or two were compromised either internally or via using their MAC associated with a bogus set of geocoordinates it would be trivial to see that those are outliers whose data should be thrown out, especially as many other "viewers" would be passing by on a daily basis, throwing that data into question.

That's not what is being brought up but rather submitting invalid data from many different accounts. Legitimate submissions would be outnumbered by someone poisoning the data without mitigations in place.

Asking for draconian requirements for a crowdsourced collection of data would be like wikipedia demanding full legal government ID for all submitters to their database. They have come up with ways to ensure that the resulting product is not grotesquely incorrect, without imposing such draconian requirements on their contributors and vastly decreasing the number of people willing to contribute.

No draconian requirement has been proposed but rather using a hardware feature which supports alternate operating systems and can support different submission apps and different builds of them too. All Android certified devices support it. iOS can't be used for this anyway. Doing it with a privacy preserving technology in fact reduces the need to use other mechanisms.

Wikipedia is full of misinformation and biased articles heavily influenced by paid influence campaigns. That is exactly the situation which needs to be avoided for databases open to contributions. It should not be possible for an attacker with a lot of resources to ruin the database or use it as part of targeted attacks. Ruining the database for a specific region could be part of an attack. This matters. Apple and Google are surely dealing with this for their databases. A serious alternative needs to consider it too. It essentially requires the same kind of approach as anti-fraud technologies. There are a lot of useful ways to do this. There's also no reason to reject the data outright but rather it can be added to the database but not used in practice. That is also an anti-abuse mechanism itself. If the requirements for data to be considered valid change, which can be flexible, then that data can be used. Analysis of which accounts submitting data are trustworthy can be dynamic and based on multiple signals. Building a high quality database that's hard to mess with not just by trolls but more sophisticated attacks is an important part of it. This is what we want, and published as open data for people to download and use for self-hosted servers, local location detection, etc. rather than being forced to use any particular service. The main job of the service should really be curating the database and getting submissions. It would be useful without any remote service for querying location at all as long as the data can be downloaded in a useful form. It can be very compressed for on-device use.

I doubt any reasonable person would expect or demand that every single record in that db be 100% accurate all the time. Case in point is how long people have put up with the Mozilla db despite the fact it has become extremely inaccurate for probably at least 2 years now. (That's how long ago I stopped using it for that reason)

The Mozilla service was never seriously considered by GrapheneOS because of the major accuracy issues and the lack of basic defenses against poisoning the data. There are many reasons people may want to poison the data not limited to harming the projects using it and hosting it.

Neither is HW attestation going to guarantee that accuracy.

It is one valuable part of an overall approach to mitigating poisoning the data.

FedericoCeratto commented 3 months ago

There are ways to reduce the privacy leaks and uploads of fake data ranging from simpler things like submitting cryptographic salted hashes of sets of AP macaddrs to using homomorphic cryptography as in https://ieeexplore.ieee.org/document/9079655

Yet, without millions of users uploading datapoints regularly the quality of any dataset would quickly degrade.

thestinger commented 3 months ago

@FedericoCeratto MAC addresses are not assigned fully randomly and are very low entropy (64-bit) so they can be determined from a hash. Apps can't access those kinds of device identifiers on modern Android anyway.

Hardware attestation is a privacy preserving way to confirm it's a real device running an OS with the security model intact based on being the stock OS or another OS preserving it. It also has the app id and app signing key hash from the OS passed to the hardware to sign. This doesn't prevent someone pretending to be many users on the same device but does help provide a baseline for mitigating fake data.

It doesn't really seem necessary to fingerprint a device in the way you're talking about. Using ANDROID_ID which is per-app-per-profile. There's a limit to how many profiles people can make at the same time too, although they can cycle through new ones.

Mitigating abuse doesn't mean needing to have a perfect way of preventing it, just making it harder, as hardware attestation would do in a meaningful way.

Also, any data gathered by the app about the device and OS can simply be faked without a mitigation for that like hardware attestation.

unchartedxx commented 3 months ago
2. On March 27th, 2024 we will stop accepting POST data submissions to the API. All submissions will receive a 403 response and the submitted data will be discarded. Additionally, we will stop publishing new exports of cell data for download.

3. On April 10th, 2024 the cell data downloads will be deleted and will no longer be available.

Could Mozilla provide full cell data exports about all the cells in their database before closing everything down? Even the full exports right now contain only the cells visible in the latest 12 months.

janc13 commented 3 months ago

Stupid question/proposal: why are we collecting (always outdated & hard to verify) WiFi MAC/BSSID locations, when WiFi APs could (in the future) be broadcasting their (approximate) location instead? Does anybody here have access to whatever forum is setting WiFi standards?

Alternatively/additionally, it should be possible to use mDNS and/or UPnP to request a (home) router’s (approximate) location. That requires you have access to the LAN/WLAN of course, but would already cover many situations where people are indoors with no GNSS available, and it could be implemented with a very simple firmware update (most home routers already have UPnP & mDNS support, so only an option to set the location would have to be added).

Both (when widely implemented) would help provide accurate enough locations for a lot of use cases, without any need for massive data collection…

I understand that both methods could be abused to broadcast an incorrect location, but that would usually be very obvious & easily ignored. (And current methods can be and are being abused too—including GNSS.)

mar-v-in commented 3 months ago

@janc13 There is 802.11mc / Wi-Fi RTT / FTM which would solve this and would even allow for location accuracy far beyond typical GPS (field tests show less than 1m offset with 95% probability). However until now, only a small set of access points and client devices have implemented it and even less are configured with accurate location information.

janc13 commented 3 months ago

That doesn’t seem to broadcast a location (except relative to the AP, which is useless if you don’t know its location)?

Also, RTT should not be needed for this, as in most cases it would be enough to know the location of the AP itself (usually you would be within ~10-50m, which is good enough for most purposes), and it seems to me like RTT would probably reveal too much precision too easily. People don’t need to know in what cupboard your AP is installed.

mar-v-in commented 3 months ago

The measurement report element of 802.11mc also includes the subelements Location Configuration Information and Location Civic Report, which provide latitude and longitude of the AP, but can also include extended details like address, apartment number, room number and even down to seat or desk number.

The RTT name is a little bit misleading, but the reason to measure RTT this accurate in first place was to allow locating, that's why location data is provided at the same time.

I disagree, that 10-50m accuracy would be good enough in all cases. It's of course good enough for weather reports and probably car navigation, but can already be a significant difference for indoor navigation, especially if multiple floors are involved. Also for ((semi-) automated) emergency calling, accuracy down to 1m would be very important to have. Think about cases, where due to a medical emergency you are unable to speak: if the information that can provided to emergency services (via AML) is "the person that had a heart-attack is in this 10 story apartment complex with more than one hundred flats", I'm sure you would prefer to have 1m accurate indoor location ;)

Sapiosenses commented 3 months ago

@janc13 There is 802.11mc / Wi-Fi RTT / FTM which would solve this and would even allow for location accuracy far beyond typical GPS (field tests show less than 1m offset with 95% probability). However until now, only a small set of access points and client devices have implemented it and even less are configured with accurate location information.

I'm including a screenshot of a settings page on my 2017 vintage Novatel/Inseego M7730L cellular hotspot which apparently implements this functionality. The latest generation of this device also has this feature and probably all the intervening models.

That said, I still prefer the idea of public collection of RF emitter data rather than relying on those who choose to enable this sort of functionality in their WiFi gear and do so in a non-malicious way.

EDIT: include Inseego doc URL

https://insg.my.site.com/insgtechsupport/s/article/IoT-MiFi-GPS-Over-WiFi

Novatel-Inseego_7730L_screenshot_GPS_feature

mar-v-in commented 3 months ago

@Sapiosenses this looks as if it was a vendor specific custom protocol. microG implements several of those that are available in wifi networks in public transport

Sapiosenses commented 3 months ago

@Sapiosenses this looks as if it was a vendor specific custom protocol. microG implements several of those that are available in wifi networks in public transport

OK, I was afraid it might be.

I did searches for "Wifi over GPS" and didn't see any obvious standards doc on it but using the 802.11mc string just now I see various things (including the AOSP implementation going back to 2018 and a lot of docs about using a related tech to geolocate someone's position inside a building)

Unfortunately when using the WiFi Alliance search engine I cannot find a single consumer router that implements certified WiFi Location. I only see some chipsets, some enterprise things from HPE/Aruba, and some client implementations.

Perhaps some IoT H/W implement it.

https://www.wi-fi.org/product-finder-results?sort_by=certified&sort_order=desc&categories=4

EDIT: This is the WiFi Alliance's page on it but it is conspicuously missing any link to a list of compliant devices. (See above)

https://www.wi-fi.org/discover-wi-fi/wi-fi-location

alexcottner commented 3 months ago

Could Mozilla provide full cell data exports about all the cells in their database before closing everything down?

Great suggestion, we took a full dump of the cell tower data and have made it available for download here.

leag1234 commented 3 months ago

e Foundation (based in the EU; behind the /e/OS project), like other projects, is obviously looking for alternatives to MLS as well. We are ready to join and help credible initiatives in the field.

Also there are remaning sub-grants up to 50K€ available from the MobiFree project that we have initiated (https://mobifree.org/). This would probably nicely fit with the development of (a part of) a new Location Service.

Also, e Foundation is ready to host such service.

Feel free to contact me about it.

lucasmz-dev commented 3 months ago

Whatever service "we" come up with, I believe should have some properties, or at least try to achieve them:

FedericoCeratto commented 3 months ago

@LucasMZReal Indeed. As mentioned before there are various cryptographic methods to protect the SSIDs and macaddrs and prevent users to have their devices and traveling habits exposed. https://ieeexplore.ieee.org/document/9079655 is an example. Bundling a whole dataset might be impractical, but fetching and caching large areas e.g. a whole city or region should be easy. Ideally the server-side component should rely on open data and not encourage user lock-in, so that multiple organizations could run the location service together.

thestinger commented 3 months ago

We're currently looking for someone to hire to work on a fully open system with open data for mapping networks to power both local location detection and services providing it including one hosted by GrapheneOS for both GrapheneOS users and individuals. This will involve creating the app for gathering data, the service for collecting/processing it and an app for consuming the data as an Android network location service with either locally downloaded data or a hosted service. Privacy and security will be heavily taken into account, and it will not have the massive privacy/security flaws of microG including the long history of location data leaks to apps or unnecessarily tying network-based location to Google service compatibility.

@LucasMZReal

Cryptographic privacy (hashing for example, SSID + MAC + other fingerprintable details, can help with making SSIDs not possible to easily look for in a public database, but findable by people who already know the details and just wanna find the location). It would also be possible for lookup, similar to how leaked password services work, where they hash the content, and then only share a fraction of the data, and the service replies with all content that matches that fraction; that way you can make it harder for the hosting to figure out the user location.

Contributions should be bundled, this is a client thing, but submissions should be bundled IMO by default for every 7 days, so that a current location can't be revealed, and tracking is not possible, the more data there is around an area, the more it could update.

Some resistance to profiling; It shouldn't be possible to check areas with more data and found out a person in a certain house, that goes to a certain school, goes to a certain clinic, is contributing to a low-contributor area. It's easy to connect dots that way in a public database, as more frequently visited places end up having more data around them. In a place where only one person is contributing, you can figure out easily what it is they're doing and where they're going, even if you're not knowing the dates. Perhaps client software should refrain from logging close-by places and detect when someone's still.

Privacy for people gathering data can be improved through not using the data until it gets confirmed by multiple separate people. People could have a privacy control to raise or lower this threshold. Delaying usage of the data based on a timer could also help with both mitigating poisoning the data and protecting privacy. Publishing the resulting data determined from the raw submitted data shouldn't hurt privacy for people gathering data in practice. It's a very theoretical problem as long as basic privacy mitigations are put in place. Once there's a solid map with most of the networks, it also becomes much less of an issue. Trying to track someone who submits data by putting new networks along the paths they might travel doesn't seem like a realistic attack and it would be easier to do it other ways if someone specific was being targeted. Also no need to publish all the raw data directly such as times, etc.

The raw data doesn't need to exist on the service in unencrypted form. It can be submitted encrypted with a public key for processing. This can avoid having the data on an internet-facing service, making it far less likely the raw data would ever be leaked. It also doesn't need to be persisted forever.

Hashing will not protect people submitting data to a service in order to obtain location from it. It will know their location from the results. The best approach to this is the resulting mapping data being open data that's possible to download and use locally with that used as the main approach rather than querying a service. Not publishing the mapping data for local use substantially hurts privacy and the concept that users need to submit location to a service in real time for privacy reasons doesn't pass muster. People need their privacy protected, not static landmarks that are already mapped with the data already available to query at https://www.wigle.net/ and elsewhere. The users of the service are the main group who need their privacy protected along with the people submitting data.

If someone knows the SSID/MAC they can look it up if queries with one network are supported even with closed data. https://www.wigle.net/ and other services provide the ability to query the data for one network already. It won't be a new capability for another service to exist providing this functionality. MAC is 64-bit with 32-bit of that used for a vendor prefix and the next 32-bit generally not being randomized but rather incremented. SSID is a label for humans and isn't generated like a password. SSID can be used to opt-out of mapping via _nomap but that's incredibly uncommon.

See https://www.wigle.net/ for an example of what's already available to query. A new service should take into account what's already available. It should be possible to explain what the privacy concerns actually are with providing open data. If there aren't any concrete concerns, it shouldn't result in crippling what a service provides and not providing open data because of it.

Resistant to patents, as Mozilla has experienced, patents can sabotage the project (federation, decentralization can help here, I see an issue where independent data could be a problem, as people would contribute to different sources, and they wouldn't be merged, and none of them would be accurate or have enough data)

We'll be making our own service for GrapheneOS with open data published for others to use locally or host themselves. We plan to implement significant mitigations for protecting the privacy of people gathering data. We also intend to add significant mitigations against the data being poisoned, which will make the mapping data that's published much more useful. We haven't decided on how this should be licensed. We generally prefer permissive licensing so that people can use it for anything, but giving everything to a proprietary service without open data could cripple it early on before it has a chance to reach critical mass.

Publishing open data usable both locally by apps, by other services, etc. is also something we think is important and it's a strong motivation to do it ourselves to make sure it happens. The overall opposition to open data permitting people to query the Wi-Fi data locally on their device is very strange particularly when services like https://www.wigle.net/ already exist which is not being acknowledged by people opposed to providing the resulting data (not raw data which may compromise privacy of people submitting it).

thestinger commented 3 months ago

@FedericoCeratto

@LucasMZReal Indeed. As mentioned before there are various cryptographic methods to protect the SSIDs and macaddrs and prevent users to have their devices and traveling habits exposed. https://ieeexplore.ieee.org/document/9079655 is an example. Bundling a whole dataset might be impractical, but fetching and caching large areas e.g. a whole city or region should be easy.

The best way to avoid a service receiving's people data in real time is doing it locally on the user's device based on downloading data for a region. Users can be offered a way to control storage usage vs. privacy. Making the data format extremely efficient should avoid needing to have much of a compromise on this. There are not a lot of cell towers and even downloading a database for the entire world could be extremely efficient and comparable to a large app. There are far more Wi-Fi networks but downloading them for a whole region is still completely practical if it's stored efficiently. It doesn't really seem necessary to even have the SSIDs for this but rather simply MAC + coordinates. A naive approach would be a massive hash table stored in zstd compressed blocks but I'm sure that there are much better ways to do it than that. It's a massive set of 64-bit integer keys with 2x 32-bit integer values. That's something which can be done super efficiently. It only needs to be queried by the keys for the purpose of local location detection, no need for querying by location.

lucasmz-dev commented 3 months ago

Point is, I'm not contributing to a project that ignores privacy because "someone else is already doing it". These services need contributors, and I'm not about to be one of them for a service that's just as problematic.

And especially not to a project that won't help others, and undermine others.