Closed SebastianZimmeck closed 12 months ago
As @n-aggarwal mentioned, does anyone have ideas here?
Here are few ideas:
We should follow this paper: https://petsymposium.org/popets/2023/popets-2023-0072.pdf
"We focused on the 8 top-ranked Android mobile apps in the 20 Google Play Store categories that have the highest number of cu- mulative app installs. Companies developing these apps fall or can be reasonably inferred to fall under the CCPA definition of a “busi- ness.”2 We selected only one mobile app (with the highest user install count) per developer in order to have the ability to match the personal information disclosed by the developer with the app that we tested and to examine a broader range of developer practices for responding to VCRs. ... any business that states in their privacy policy that they respond to CCPA VCRs must actually do so, regardless of whether or not they are actually covered by the CCPA. Two researchers from our team independently read the text of 160 privacy policies to determine whether or not each contained ref- erences to the CCPA. For cases without a majority consensus, a third researcher provided the tie-breaking vote. Our analysis indicated that out of the selected 160 apps, 109 (68%) include CCPA-specific disclosures in their privacy policies (with Krippendorff’s alpha = 0.81, indicating an acceptable level of inter-rater agreement [33]). For the remainder of this paper, our discussion will focus primarily on these 109 apps."
Additionally, all apps are required to provide a privacy policy on their Play Store page. We could automatically download them, and check whether they mention GPC (or CCPA) and claim to follow it. If they do, then we can assume that they follow the rules.
On APKPure, there is a way to filter for top apps within the US, but not California itself.
Perhaps one way is to prioritize depth > breadth, may be tedious but we can use this list as a start (https://www.similarweb.com/apps/top/google/app-index/us/all/top-free/), and then select top 10 apps from each category which claims to follow GPC/CCPA
This link is also interesting: https://www.androidrank.org/android-most-popular-google-play-apps?category=ART_AND_DESIGN
On APKPure, there is a way to filter for top apps within the US, but not California itself.
Let's crawl from the US Play Store. If you connect to Google Play from within the US, it will work. Not sure if that means, @n-aggarwal, you would need to do the crawling. @wesley-tan, I imagine a VPN may not work. Although, we have Mullvad, and I can let you know how to connect.
these aren't VPN things, but rather related to the Google account that's used. those are tied to countries
The good news is that the Google Play scraper I use (https://github.com/facundoolano/google-play-scraper) defaults to the US. I can specify the following:
appId: the Google Play id of the application (the ?id= parameter on the url). lang (optional, defaults to 'en'): the two letter language code in which to fetch the app page. country (optional, defaults to 'us'): the two letter country code used to retrieve the applications. Needed when the app is available only in some countries.
Moving ahead, perhaps one way to go about this is I can first scrape the apps for the TOP_FREE in the US under a category (e.g. top 50 top free apps in ACTION)
const gplay = require('google-play-scraper');
gplay.list({
collection: gplay.collection.TOP_FREE,
category: gplay.category.GAME_ACTION,
num: 50,
fullDetail: true,
})
.then(console.log, console.error);
then, I can filter it such that we only care about the apps above a stipulated amount of installs (50 million here)
const gplay = require('google-play-scraper');
gplay.list({
collection: gplay.collection.TOP_FREE,
category: gplay.category.GAME_ACTION,
num: 100,
fullDetail: true,
})
.then((apps) => {
return apps.filter((app) => {
const installs = Number(app.installs.replace(/,/g, ''));
return installs >= 50000000;
});
})
.then(console.log, console.error);
I don't exactly know a faster way to filter for the privacy policy parts, but at the very least, we can have a list of viable apps and we can go through the privacy policy accordingly? So perhaps our final list of apps is something like
Perhaps the main agenda next meeting could be to finalize a method for choosing and filtering the apps
Yes, it would be great if you can come up with a plan as basis for our discussion, @wesley-tan. And then we should start the crawl soon.
The good news is that the Google Play scraper I use (https://github.com/facundoolano/google-play-scraper) defaults to the US. I can specify the following:
appId: the Google Play id of the application (the ?id= parameter on the url). lang (optional, defaults to 'en'): the two letter language code in which to fetch the app page. country (optional, defaults to 'us'): the two letter country code used to retrieve the applications. Needed when the app is available only in some countries.
Moving ahead, perhaps one way to go about this is I can first scrape the apps for the TOP_FREE in the US under a category (e.g. top 50 top free apps in ACTION)
const gplay = require('google-play-scraper'); gplay.list({
Moving ahead, we can do the following (all of the numbers mentioned should be discussed on our meeting on Monday)
We decided that we start dowloading apps, and then later decide whether apps need to follow GPC / CCPA.
Once @wesley-tan has the first 50 or so apps incoming and squared away, including metadata, we can start analyzing, look, at our data, etc, and settle on a detailed methodology.
I have downloaded the first 45 apps (I used 45 because that is the top 45 apps that appears on the display/GUI on google play desktop. All downloaded using apkeep using google-play (from google play store itself). I have attached the metdata csv file in the folder as well. Of course, I will continue downloading the apps tomorrow. But it will take time since I realize I need more google play accounts (on that note, if anyone has any spare google play accounts for experiment/testing purpose, that would be welcome too)
I focused on GAME_ACTION and TOP_FREE, downloaded between 1150PM SGT (17 Jul) to 1220AM (18 Jul) - I will find a more systematic way of recording this (perhaps a google sheet in the drive itself) The files are still uploading from my laptop to Google Drive as a type this, but when this is done there should be 45 apks and a csv file in the folder
I will find a more systematic way of recording this (perhaps a google sheet in the drive itself)
Yes, I think it would be good to have a Google Sheet or similar accompanying document that includes the date an app was downloaded etc. and its Play Store metadata (these could also be two documents, one for our download metadata and one for the Play Store metadata, either way would work).
This task is done, but will be working with Nishant to ensure that the apps are properly tested. Will extend this deadline to end Aug.
We are going to check whether the Disconnect list reasonably maps to apps that need to respect GPC.
@JustinCasler will be touching base with @sophieeng and @OliverWang13.
Mapping trackers to Disconnect list If (most of) the trackers on the Disconnect list (as a whole or only certain categories) satisfy the requirements of the law, we can use the Disconnect list as a proxy for GPC applicability.
Mapping trackers data broker list
We decided to wait with this issue until next week (unless @n-aggarwal has bandwidth). If @sophieeng and @OliverWang13 find the disconnect list useful, we should be able to apply the same in the context here.
@zatchliu will open a new issue on whether to use Disconnect and/or Oxford list.
Not all apps are required to respect GPC per the CCPA (and likely the new Colorado and Connecticut privacy laws).
First, the CCPA has to be applicable, i.e., we need a "business." Here are the requirements per CCPA 1798.140. (d):
Maybe, an easy proxy would be to identify whether an app has 100,000 or more downloads or installs. However, the 100,000 would refer to consumers from California. Maybe, we can do some further approximation and assume that all app users are equally distributed across the US and calculate based on that whether we reach 100,000 for California.
Maybe, we could also do something with "Derives 50 percent or more of its annual revenues from selling or sharing consumers’ personal information" since we are only analyzing free apps. But then again maybe that is not true for the company as a whole.
Second, we need the sale of personal information. This point is a bit easier because sharing data with an ad network would qualify as sale.
This question may have some impact on how we collect our app set and how we analyze apps (issues #71 and #73).