privacy-tech-lab / gpc-android

Code and dynamic analysis scripts for GPC on Android
https://privacytechlab.org/
MIT License
5 stars 1 forks source link

Come up with methodology to identify whether apps need to follow GPC (i.e., CCPA and similar laws) #78

Closed SebastianZimmeck closed 12 months ago

SebastianZimmeck commented 1 year ago

Not all apps are required to respect GPC per the CCPA (and likely the new Colorado and Connecticut privacy laws).

First, the CCPA has to be applicable, i.e., we need a "business." Here are the requirements per CCPA 1798.140. (d):

"Business" means:

(1) A sole proprietorship, partnership, limited liability company, corporation, association, or other legal entity that is organized or operated for the profit or financial benefit of its shareholders or other owners, that collects consumers’ personal information, or on the behalf of which such information is collected and that alone, or jointly with others, determines the purposes and means of the processing of consumers’ personal information, that does business in the State of California, and that satisfies one or more of the following thresholds:

(A) As of January 1 of the calendar year, had annual gross revenues in excess of twenty-five million dollars ($25,000,000) in the preceding calendar year, as adjusted pursuant to paragraph (5) of subdivision (a) of Section 1798.185.

(B) Alone or in combination, annually buys, sells, or shares the personal information of 100,000 or more consumers or households.

(C) Derives 50 percent or more of its annual revenues from selling or sharing consumers’ personal information.

Maybe, an easy proxy would be to identify whether an app has 100,000 or more downloads or installs. However, the 100,000 would refer to consumers from California. Maybe, we can do some further approximation and assume that all app users are equally distributed across the US and calculate based on that whether we reach 100,000 for California.

Maybe, we could also do something with "Derives 50 percent or more of its annual revenues from selling or sharing consumers’ personal information" since we are only analyzing free apps. But then again maybe that is not true for the company as a whole.

Second, we need the sale of personal information. This point is a bit easier because sharing data with an ad network would qualify as sale.

This question may have some impact on how we collect our app set and how we analyze apps (issues #71 and #73).

SebastianZimmeck commented 1 year ago

As @n-aggarwal mentioned, does anyone have ideas here?

Here are few ideas:

kasnder commented 1 year ago

We should follow this paper: https://petsymposium.org/popets/2023/popets-2023-0072.pdf

"We focused on the 8 top-ranked Android mobile apps in the 20 Google Play Store categories that have the highest number of cu- mulative app installs. Companies developing these apps fall or can be reasonably inferred to fall under the CCPA definition of a “busi- ness.”2 We selected only one mobile app (with the highest user install count) per developer in order to have the ability to match the personal information disclosed by the developer with the app that we tested and to examine a broader range of developer practices for responding to VCRs. ... any business that states in their privacy policy that they respond to CCPA VCRs must actually do so, regardless of whether or not they are actually covered by the CCPA. Two researchers from our team independently read the text of 160 privacy policies to determine whether or not each contained ref- erences to the CCPA. For cases without a majority consensus, a third researcher provided the tie-breaking vote. Our analysis indicated that out of the selected 160 apps, 109 (68%) include CCPA-specific disclosures in their privacy policies (with Krippendorff’s alpha = 0.81, indicating an acceptable level of inter-rater agreement [33]). For the remainder of this paper, our discussion will focus primarily on these 109 apps."

Additionally, all apps are required to provide a privacy policy on their Play Store page. We could automatically download them, and check whether they mention GPC (or CCPA) and claim to follow it. If they do, then we can assume that they follow the rules.

wesley-tan commented 1 year ago

On APKPure, there is a way to filter for top apps within the US, but not California itself.

Screenshot 2023-07-14 at 4 29 17 PM

Perhaps one way is to prioritize depth > breadth, may be tedious but we can use this list as a start (https://www.similarweb.com/apps/top/google/app-index/us/all/top-free/), and then select top 10 apps from each category which claims to follow GPC/CCPA

This link is also interesting: https://www.androidrank.org/android-most-popular-google-play-apps?category=ART_AND_DESIGN

SebastianZimmeck commented 1 year ago

On APKPure, there is a way to filter for top apps within the US, but not California itself.

Let's crawl from the US Play Store. If you connect to Google Play from within the US, it will work. Not sure if that means, @n-aggarwal, you would need to do the crawling. @wesley-tan, I imagine a VPN may not work. Although, we have Mullvad, and I can let you know how to connect.

kasnder commented 1 year ago

these aren't VPN things, but rather related to the Google account that's used. those are tied to countries

wesley-tan commented 1 year ago

The good news is that the Google Play scraper I use (https://github.com/facundoolano/google-play-scraper) defaults to the US. I can specify the following:

appId: the Google Play id of the application (the ?id= parameter on the url). lang (optional, defaults to 'en'): the two letter language code in which to fetch the app page. country (optional, defaults to 'us'): the two letter country code used to retrieve the applications. Needed when the app is available only in some countries.

Moving ahead, perhaps one way to go about this is I can first scrape the apps for the TOP_FREE in the US under a category (e.g. top 50 top free apps in ACTION)

const gplay = require('google-play-scraper');

gplay.list({
  collection: gplay.collection.TOP_FREE,
  category: gplay.category.GAME_ACTION,
  num: 50, 
  fullDetail: true,
})
.then(console.log, console.error);

then, I can filter it such that we only care about the apps above a stipulated amount of installs (50 million here)

const gplay = require('google-play-scraper');

gplay.list({
  collection: gplay.collection.TOP_FREE,
  category: gplay.category.GAME_ACTION,
  num: 100,
  fullDetail: true,
})
.then((apps) => {
  return apps.filter((app) => {
    const installs = Number(app.installs.replace(/,/g, ''));
    return installs >= 50000000;
  });
})
.then(console.log, console.error);

I don't exactly know a faster way to filter for the privacy policy parts, but at the very least, we can have a list of viable apps and we can go through the privacy policy accordingly? So perhaps our final list of apps is something like

SebastianZimmeck commented 1 year ago

Perhaps the main agenda next meeting could be to finalize a method for choosing and filtering the apps

Yes, it would be great if you can come up with a plan as basis for our discussion, @wesley-tan. And then we should start the crawl soon.

wesley-tan commented 1 year ago

The good news is that the Google Play scraper I use (https://github.com/facundoolano/google-play-scraper) defaults to the US. I can specify the following:

appId: the Google Play id of the application (the ?id= parameter on the url). lang (optional, defaults to 'en'): the two letter language code in which to fetch the app page. country (optional, defaults to 'us'): the two letter country code used to retrieve the applications. Needed when the app is available only in some countries.

Moving ahead, perhaps one way to go about this is I can first scrape the apps for the TOP_FREE in the US under a category (e.g. top 50 top free apps in ACTION)

const gplay = require('google-play-scraper');

gplay.list({

Moving ahead, we can do the following (all of the numbers mentioned should be discussed on our meeting on Monday)

  1. Wesley to generate 48 lists (48 categories). These lists will contain the list of the top 20 apps in each category (e.g. DATING) in the US. I can do this immediately after our meeting on Monday if confirmed
  2. There will therefore be 960 possible applications, and we can split the work to read through the privacy policies. Given 5 readers, we can search for privacy policies and each person can read through about 90 per week. So this process will take 2 weeks under this assumption
  3. Assuming about the same % rate, "Our analysis indicated that out of the selected 160 apps, 109 (68%) include CCPA-specific disclosures in their privacy policies (with Krippendorff’s alpha = 0.81, indicating an acceptable level of inter-rater agreement [33]). For the remainder of this paper, our discussion will focus primarily on these 109 apps."", so we should get about 500-600 applications to analyze.
  4. I will download this remaining applications and place it in the Drive
  5. Then, proceed with the analysis of these 500-600 applications.
kasnder commented 1 year ago

We decided that we start dowloading apps, and then later decide whether apps need to follow GPC / CCPA.

SebastianZimmeck commented 1 year ago

Once @wesley-tan has the first 50 or so apps incoming and squared away, including metadata, we can start analyzing, look, at our data, etc, and settle on a detailed methodology.

wesley-tan commented 1 year ago

I have downloaded the first 45 apps (I used 45 because that is the top 45 apps that appears on the display/GUI on google play desktop. All downloaded using apkeep using google-play (from google play store itself). I have attached the metdata csv file in the folder as well. Of course, I will continue downloading the apps tomorrow. But it will take time since I realize I need more google play accounts (on that note, if anyone has any spare google play accounts for experiment/testing purpose, that would be welcome too)

I focused on GAME_ACTION and TOP_FREE, downloaded between 1150PM SGT (17 Jul) to 1220AM (18 Jul) - I will find a more systematic way of recording this (perhaps a google sheet in the drive itself) The files are still uploading from my laptop to Google Drive as a type this, but when this is done there should be 45 apks and a csv file in the folder

The Google Drive folder is found here.

SebastianZimmeck commented 1 year ago

I will find a more systematic way of recording this (perhaps a google sheet in the drive itself)

Yes, I think it would be good to have a Google Sheet or similar accompanying document that includes the date an app was downloaded etc. and its Play Store metadata (these could also be two documents, one for our download metadata and one for the Play Store metadata, either way would work).

wesley-tan commented 1 year ago

This task is done, but will be working with Nishant to ensure that the apps are properly tested. Will extend this deadline to end Aug.

SebastianZimmeck commented 1 year ago

We are going to check whether the Disconnect list reasonably maps to apps that need to respect GPC.

SebastianZimmeck commented 1 year ago

@JustinCasler will be touching base with @sophieeng and @OliverWang13.

SebastianZimmeck commented 1 year ago

We decided to wait with this issue until next week (unless @n-aggarwal has bandwidth). If @sophieeng and @OliverWang13 find the disconnect list useful, we should be able to apply the same in the context here.

SebastianZimmeck commented 12 months ago

@zatchliu will open a new issue on whether to use Disconnect and/or Oxford list.