privacy-tech-lab / gpc-android

Code and dynamic analysis scripts for GPC on Android
https://privacytechlab.org/
MIT License
5 stars 1 forks source link

Select and download 1,000 top apps #71

Closed kasnder closed 1 year ago

kasnder commented 1 year ago

This would be important to get us started. We need to make sure that we download split akps and download the right architecture (e.g. arm64-v8a).

You can use this tool to select (but not download) the top 50 apps from each Play Store genre: https://github.com/facundoolano/google-play-scraper

You might find this tool helpful for downloading: https://github.com/onyxbits/raccoon4

n-aggarwal commented 1 year ago

There are a couple of different ways we could partition our list.

Options:

  • collection (optional, defaults to collection.TOP_FREE): the Google Play collection that will be retrieved. Available options can be found here.
  • category (optional, defaults to no category): the app category to filter by. Available options can bee found here.
  • age (optional, defaults to no age filter): the age range to filter the apps (only for FAMILY and its subcategories). Available options are age.FIVE_UNDER, age.SIX_EIGHT, age.NINE_UP.
  • num (optional, defaults to 500): the amount of apps to retrieve.
  • lang (optional, defaults to 'en'): the two letter language code used to retrieve the applications.
  • country (optional, defaults to 'us'): the two letter country code used to retrieve the applications.
  • fullDetail (optional, defaults to false): if true, an extra request will be made for every resulting app to fetch its full detail.

There are 3 Collections:

There are 54 Categories:

Either we could simply get the list of apps by Collection ("top free" and "top grossing"), or we can divide the list up by category, get "top free" and "top grossing" by each category. I am not sure which would be better. On one hand, simply getting the list by Collection gives us the most popular, most used apps; on the other hand, dividing by categories gives us in a way an extra control variable, so we can see if there are any significant differences across categories, although I am not sure what we could say about that.

I don't think we need to get a list of top paid apps for a couple of reasons:

  1. Paid apps tend to not have ads
  2. They are not as popular as free apps (for obvious reasons).
  3. It would cause unnecessary overhead

Also, do we want to touch on Apps designed for kids? Because I believe they have more strict guidelines than normal apps.

SebastianZimmeck commented 1 year ago

Either we could simply get the list of apps by Collection ("top free" and "top grossing"), or we can divide the list up by category, get "top free" and "top grossing" by each category.

Either "top free" and/or "top grossing" works for me. What are the criteria for including and sorting apps in the "top free" list? Total downloads, current installs, something else ...?

dividing by categories gives us in a way an extra control variable, so we can see if there are any significant differences across categories

I think that could be a nice extra dimension. Are there differences between app from different categories? So, if there are no other disadvantages (difficult to do, ...), getting an app set from each category could be nice. We could also do 1,000 top free + 50 apps from each category with some intersection of apps being in top free and the category sets.

I don't think we need to get a list of top paid apps for a couple of reasons:

These are good reasons. I will also add the reason that they are paid. :) (If I recall correctly, you could return a paid app within some time period and get a refund; it may also be possible to get paid apps for free from some unofficial store, but I would have doubts about such apps from such resource.)

We also should consider the option of one magnitude more, i.e., 10,000 apps. Whether we can pull it off depends on how intricate our analysis procedure ends up to be, how fast the analysis will be, which resources we need, ...

n-aggarwal commented 1 year ago

What are the criteria for including and sorting apps in the "top free" list? Total downloads, current installs, something else ...?

Unfortunately, the scraper doesn't give any additional criteria for sorting apps. What we can do though is get a bigger list that we need, with "full detail". Full detail adds a lot of information like minimum installs, maximum installs, etc returned in JSON format. So, once we have a list of apps in full detail, we can write a short program that uses an if statement to filter through the list as we desire! The only issue with this method is that we might get temporarily banned from play store a several times while doing this because of the huge number of requests.

All methods on the scraper have to access the Google Play server in one form or another. When making too many requests in a short period of time (specially when using the fullDetail option), is common to hit Google Play's throttling limit. That means requests start getting status 503 responses with a captcha to verify if the requesting entity is a human (which is not :P). In those cases the requesting IP can be banned from making further requests for a while (usually around an hour).

Here is a sample of the JSON object returned when using the full detail option in list:

gplay
  .list({
    category: gplay.category.GAME_ACTION,
    collection: gplay.collection.TOP_FREE,
    num: 2,
    fullDetail: true,
  })
  .then(console.log, console.log);
Nishants-MacBook-Air:scripts nishantaggarwal$ node app-list.js 
[
  {
    title: 'Where is He: Hide and Seek',
    description: 'Are you ready for a whole new experience being a Dad?\n' +
      '\n' +
      'I AM A DAD AND YOU ARE MY ONE AND ONLY 👶\n' +
      '\n' +
      'In  Where is He: Hide and Seek , you will take on the role of a father with a very mischievous child, and your goal will be to watch after and safeguard the infant from any threats. However, your naughty will hide everywhere he can\n' +
      '\n' +
      '🍼 HOW TO PLAY 🍼\n' +
      '- Play as daddy and find or babi have to hide\n' +
      '- Hide sneakily or you will be found\n' +
      '- Seek every suspicious place to find children\n' +
      '\n' +
      '🍼 GAME FEATURE 🍼\n' +
      '- Never End Level\n' +
      '- Various of Map for you to explore\n' +
      '- Stunning graphic and smooth movement\n' +
      '- Various cool skins\n' +
      '\n' +
      'Gotta go to find my babi in  Where is He: Hide and Seek \n' +
      '\n' +
      'DOWNLOAD NOW',
    descriptionHTML: 'Are you ready for a whole new experience being a Dad?<br><br>I AM A DAD AND YOU ARE MY ONE AND ONLY 👶<br><br>In <b> Where is He: Hide and Seek </b>, you will take on the role of a father with a very mischievous child, and your goal will be to watch after and safeguard the infant from any threats. However, your naughty will hide everywhere he can<br><br>🍼 HOW TO PLAY 🍼<br>- Play as daddy and find or babi have to hide<br>- Hide sneakily or you will be found<br>- Seek every suspicious place to find children<br><br>🍼 GAME FEATURE 🍼<br>- Never End Level<br>- Various of Map for you to explore<br>- Stunning graphic and smooth movement<br>- Various cool skins<br><br>Gotta go to find my babi in <b> Where is He: Hide and Seek </b><br><br>DOWNLOAD NOW',
    summary: 'Ready for being daddy and take care your naughty newborn',
    installs: '1,000,000+',
    minInstalls: 1000000,
    maxInstalls: 1786612,
    score: 4.55,
    scoreText: '4.6',
    ratings: 2215,
    reviews: 35,
    histogram: { '1': 155, '2': 65, '3': 40, '4': 85, '5': 1860 },
    price: 0,
    free: true,
    currency: 'USD',
    priceText: 'Free',
    available: true,
    offersIAP: true,
    IAPRange: '$0.99 per item',
    androidVersion: '5.1',
    androidVersionText: '5.1',
    developer: 'Windy Game',
    developerId: 'Windy+Game',
    developerEmail: 'windygame.customerservices@gmail.com',
    developerWebsite: 'http://windygame.online/app-ads.txt',
    developerAddress: undefined,
    privacyPolicy: 'http://windygame.online/windygame-privacy-policy.html',
    developerInternalID: 'Windy+Game',
    genre: 'Action',
    genreId: 'GAME_ACTION',
    familyGenre: undefined,
    familyGenreId: undefined,
    icon: 'https://play-lh.googleusercontent.com/JHCfc662aSRGnA0bVV1fpLr0mqq4Dk9ZGWxi-cF8e0puuAWQksOR2BjMr9UX70Mncsc',
    headerImage: 'https://play-lh.googleusercontent.com/mtDwzCtHUyhptRnvaX5odhmP5oNOkyWdGwrGh_UkWwcSuvntPXGHydyTY_Rs3cZ8EH9-',
    screenshots: [
      'https://play-lh.googleusercontent.com/IR-giyL8w4ljo7W-q4XprhEsaigezebvlsJC5D_qy1lpIm2NJjTKRgSKtNAAL7ctdsgI',
      'https://play-lh.googleusercontent.com/m7zk-otw_wV_VgVRjhOkB9JdLbJ4TtZEKyR5IGRy91J6HdeN2_Sg0REquy7KWRVYihk',
      'https://play-lh.googleusercontent.com/WnsULMr4BkAZJ7sPnt79wldKVGArH4k5uhqNgVgZRnofwDp-ZACtsOSTQwHeVeeqoeI',
      'https://play-lh.googleusercontent.com/yTyvULuzAG7UmA8kzcNRILRhicpoIboqhWFToJVvZdYqKvWQZ82bZyJZX1Cg8_YSsQ7D',
      'https://play-lh.googleusercontent.com/Da1lI0QJ0_8MTZ6RcZ48WY1WJpu-Djia2S9sCOrf3M5nXSDu4YjHQW2CIGKwT1CAoQ',
      'https://play-lh.googleusercontent.com/DCNvsftU4fh6SiJDaIVSTpoROT6pmpzeU2wL7-N2_wUhlAKMeeJFO28wJ8P1Evnm7A',
      'https://play-lh.googleusercontent.com/zz0B7KIU4Wq0SeTTk-fgDgttqAKG59R0lY3Avn_9oQweQdnjn-xnzio1uKAjK_qKpZQ',
      'https://play-lh.googleusercontent.com/4rqYLwLg9t_H9OuGcPbxQ5HFsEjPv7NbLwa6F875LYGDMZKsoNE3TedXI5714GkiD1s',
      'https://play-lh.googleusercontent.com/8xKz4Xp56t7l658OKN6I0W2DD7euTWFwNUlTd0Os7qC5q4UxUOzsAMbr7oiwrWqiAJ4',
      'https://play-lh.googleusercontent.com/hyklzEHwYS7ipw0oEbvj_UcrmyW5VSCv4MTN_QfI9qs2jKkerCTcO0ha_Sb4f8svid_x',
      'https://play-lh.googleusercontent.com/kti53ARlARN1_ZimD1_dP20XyBZsMjJ__ZPCA2fPNtqP1BjijAat9D8Gxxn60xayjEWr',
      'https://play-lh.googleusercontent.com/fw9YbXkB-0mCTTOABNn2VF1JwX5RKJoAMaLAZxQgZoupegOSQgN_CwXSCfdXSeLb7JY',
      'https://play-lh.googleusercontent.com/yHikt6zNA7MdFAZx7B7LWacT1zk3R5lXyiSjasJsFcaQEqC8EIyNJUSyn9CQD4gscw',
      'https://play-lh.googleusercontent.com/69MlSkItDvbz-9cD3VaNgkJs_vJ9bxvd5OZyjgAG7mLF7gJMwJ-Ug6UI_RSSfDHJmUo',
      'https://play-lh.googleusercontent.com/g1PFGpZoR-icOscv1Uy0sdRj06z7ZatZAWH2JJtCg66t-VNIzEGrKngRUYOOLbfywA',
      'https://play-lh.googleusercontent.com/08nytGI6bHJQk1BnrIBIyzc0SPtZ05dQHBbistLmrxwG_4wTtLQzjpdjULbfyCliJJg',
      'https://play-lh.googleusercontent.com/lEM5OKNOQcC7WbhjTMMnavf8ZO6MuhU9DrpcQISnnnREIYobh6O-fsDwTbkkvCqZ30g',
      'https://play-lh.googleusercontent.com/G2gYgx-fAd6b4TnHjMHC1AMsxveNri9-UnJ3qqj0Li4gQUKNxrjmKDs1Cv2t7IvwzjJS',
      'https://play-lh.googleusercontent.com/ItJsvwNiJ4loxwRbWTD9uPFJcxwZZtKPF-FaEyIFXSHLujx-xctNIesemykOqSaiLaw',
      'https://play-lh.googleusercontent.com/18BMgaU9gJoTVmjKTDroSoo-pnF4PdNXY2cab87qhjeRknBV2PpqfecZXphiVB1JLA',
      'https://play-lh.googleusercontent.com/-ofWSfzbV1Ze60CwLOpUPbEiEczm6aXXJGyqH-AcoFBq6uMtpxlM3ebuOjN896wu2Fw'
    ],
    video: undefined,
    videoImage: undefined,
    contentRating: 'Everyone',
    contentRatingDescription: undefined,
    adSupported: true,
    released: 'May 8, 2023',
    updated: 1685416656000,
    version: '1.0.7',
    recentChanges: '+ Optimize performance<br>+ Fix bugs',
    comments: [ undefined, undefined, undefined, undefined, undefined ],
    appId: 'com.htt.baby.hunt.hn',
    url: 'https://play.google.com/store/apps/details?id=com.htt.baby.hunt.hn&hl=en&gl=us'
  },
  {
    title: 'Tomb of the Mask',
    description: "Tomb of the Mask is an arcade game with an infinite procedurally generated vertical labyrinth. Seeking for adventure you get into a tomb where you find a strange mask. You put it on and suddenly realize that you can now climb walls - easily and promptly. And that's when all the fun begins. \n" +
      '\n' +
      "You'll face a variety of traps, enemies, game mechanics and power-ups. And as far as time doesn't wait, get a grip and up you go.",
    descriptionHTML: 'Tomb of the Mask is an arcade game with an infinite procedurally generated vertical labyrinth. Seeking for adventure you get into a tomb where you find a strange mask. You put it on and suddenly realize that you can now climb walls - easily and promptly. And that&#39;s when all the fun begins. <br><br>You&#39;ll face a variety of traps, enemies, game mechanics and power-ups. And as far as time doesn&#39;t wait, get a grip and up you go.',
    summary: 'Addictive tomb labyrinth. Time limit and lava are advancing – go only up!',
    installs: '100,000,000+',
    minInstalls: 100000000,
    maxInstalls: 311920071,
    score: 4.540936,
    scoreText: '4.5',
    ratings: 2152369,
    reviews: 74228,
    histogram: { '1': 103257, '2': 42753, '3': 73602, '4': 299472, '5': 1633230 },
    price: 0,
    free: true,
    currency: 'USD',
    priceText: 'Free',
    available: true,
    offersIAP: true,
    IAPRange: '$0.99 - $99.99 per item',
    androidVersion: '5.0',
    androidVersionText: '5.0',
    developer: 'Playgendary Limited',
    developerId: '4614678246860437532',
    developerEmail: 'support@playgendary.com',
    developerWebsite: 'http://www.playgendary.com/',
    developerAddress: '94 Amathountos Avenue street\n' +
      'Agios Tychonas\n' +
      '4532, Limassol, Cyprus\n' +
      'support@playgendary.com',
    privacyPolicy: 'http://www.playgendary.com/privacy-policy/',
    developerInternalID: '4614678246860437532',
    genre: 'Action',
    genreId: 'GAME_ACTION',
    familyGenre: undefined,
    familyGenreId: undefined,
    icon: 'https://play-lh.googleusercontent.com/V3M-ZVyu4NCOR8oxeHEQSrt5LJhu44xlDtIx-d3YnrPCMEvV-lxPJDyWoJkxsS1Soxsx',
    headerImage: 'https://play-lh.googleusercontent.com/7zBCwa4UBRUhD5SXPS0O1wpwswZ08YCC3kVlENg9GwLbdPOsbfGpybPmkSdSzjf6lA',
    screenshots: [
      'https://play-lh.googleusercontent.com/S6E7pqnJgkWlW089zax3hqFvGvNe3K-5jsXA81IxXjmRVdGakZ46zFVPXrs1NSbocC0',
      'https://play-lh.googleusercontent.com/7ofrj4hLqlA4cM5YRC-l0jKbwP_DcxDywNigg_i7L0OeHwORqzHm6Rr-M-c-nBxGlso',
      'https://play-lh.googleusercontent.com/Ye66yCj01S0yOUHdn6FJDMpeqVdfIfhV316sILOcKVbCLeEFRa-j9xRp1xMznuVfJw',
      'https://play-lh.googleusercontent.com/io7qroKcKtU1NKHzLASew0LQDADtmoyAms606bLqMsHQBO0Yq4JxhgHrXfN1kuf6q4Y',
      'https://play-lh.googleusercontent.com/-9b_yud4ImUzSXJwUL-C8seve85rES_Z5JkbzgWsqK2fN0mFoEuLLpLL_ham0VozcwY',
      'https://play-lh.googleusercontent.com/BhK0VLswnx8Ab7Uzh39MjQDO_1bUnMtUeV8IWMSyD42EASpAf6pXbA8OyU88PXiftnU',
      'https://play-lh.googleusercontent.com/mqUy9RWgNdaddsrrZZMAPlwmfg55A1YpZqdVj4fqeonZ3iOc4XP73r5JOsBKE1cP6g',
      'https://play-lh.googleusercontent.com/gFqm2C0Eaf-m_su6_a9aDdyrAiNUcyaASzc9FWQLyvzJtxcUH_-bhAb4Kxbatlk4nu8',
      'https://play-lh.googleusercontent.com/6T63FrVJFxZpD7RrZLYAa5ZTYd-INOqSOPEetHsVdaonInEICJiSj_pXNcKmx91s0R0',
      'https://play-lh.googleusercontent.com/XWO2BRAnWdB6l4ZJUgHWNXBrukMfszk1fpa27zVWxlcgCJZxnm6DMZf81joG5kz8ibY',
      'https://play-lh.googleusercontent.com/Q2kBFE_DKFp4j47KkHv-U9fmEj8SzuchHiJaoLCqp6oAYgh5WdCJzUrMIGnHHEueEu0',
      'https://play-lh.googleusercontent.com/kcopdNxD0LGRT2aa4oQt6ix4fzsA5nU7jau3lfoaKdjg2o-kBd8PIGcbseiB1hUChUk',
      'https://play-lh.googleusercontent.com/uVldt7cyz4kdtXe1-BXm5N17gNdEBsDztQe7-3Ag-f3bnPzEsm9opKNxiqCXiUcrYaE',
      'https://play-lh.googleusercontent.com/hecEYUnPvzv7nMpiIG53PWhpK1fqNFRVzzl7_MwWeKCUGfHqTQaGMGL-fUM8x8i1kUcl',
      'https://play-lh.googleusercontent.com/uU39X25Nf1wzjoU9Pd9hrPYayF2SRdNgVEqNtK1OUc2a3FLvCC1alog4laGCTu7kmDfI',
      'https://play-lh.googleusercontent.com/y-0Cv8T0U51FYWnGkWziDi7EWwPXeBTcznJbIOoQ9H4oogaM5Q-ATMr_6biuDHkDUrfF',
      'https://play-lh.googleusercontent.com/RDGZMMpWwWHpqflwamlAvlCGCSqiSlT27sOD2Zk75qGJM3uwcFzs6vDlAO5hUbD8zvI',
      'https://play-lh.googleusercontent.com/hGL1MgVGerO2hhjwSSy7qjh5n73Nc7tT7Bkp-kJLUCc3ET4bAX0WHqv1OG0gM9--bQI',
      'https://play-lh.googleusercontent.com/I1MtHc2Onwb4blQX0AfIubbPp0JyZuM0_ppt_KKxo3CXDpMIrILG78viYL2rC-gNE-E',
      'https://play-lh.googleusercontent.com/Vtpvsit0XCfuab0plbH2P43hXOth0zFRNF-27kYgBRERvW4m5QF49aRO6_NOzIM8Lcc',
      'https://play-lh.googleusercontent.com/WnXBMdDL6WFc4NEpNCk2WX_AOdc3lRevdoKFfcL8NtBmJvXkLd3_e9jrevbm5-QmRzDS'
    ],
    video: 'https://www.youtube.com/embed/6MI2KYJwYyk?ps=play&vq=large&rel=0&autohide=1&showinfo=0',
    videoImage: 'https://play-lh.googleusercontent.com/7zBCwa4UBRUhD5SXPS0O1wpwswZ08YCC3kVlENg9GwLbdPOsbfGpybPmkSdSzjf6lA',
    contentRating: 'Everyone',
    contentRatingDescription: undefined,
    adSupported: true,
    released: 'Jun 19, 2018',
    updated: 1684941700000,
    version: '1.10.11',
    recentChanges: 'We are ready to make your game experience even greater! Bugs are fixed and game performance is optimized. Enjoy!<br><br>Our team reads all reviews and always tries to make the game better. Please leave us some feedback if you love what we do and feel free to suggest any improvements.',
    comments: [ undefined, undefined, undefined, undefined, undefined ],
    appId: 'com.playgendary.tom',
    url: 'https://play.google.com/store/apps/details?id=com.playgendary.tom&hl=en&gl=us'
  }
]
Nishants-MacBook-Air:scripts nishantaggarwal$ 
SebastianZimmeck commented 1 year ago

So, once we have a list of apps in full detail, we can write a short program that uses an if statement to filter through the list as we desire!

That would be an option.

I would say more in general, though, what matters is that we follow a methodological approach to create the app set. For example, if we wanted our app set to be representative of the Play Store population of apps, then we need to ensure that we not only download apps from just one category or one developer.

Here are some criteria for selecting apps:

Here is one available list (not sure if that is the one we use because they do not show a date of when their ranking is current and the list has only 500 entries). But there may be other similar lists and services that create rankings. Most will be paid, e.g. I believe this one is, but maybe we can find a free one.

We can also create our own list according to the criteria above (or others), but to my knowledge there is no filtering or ordering I know of on the Play Store for, say, the top 1,000 free apps according to download. So, we would need to come up with a strategy to approximate that.

We could also crawl app metadata for a large set, say, 10,000 or 100,000 apps according to some methodological criteria, and then select a subset from that set (e.g. top 50 from each category).

I would approach this task rather opportunistically, i.e., what is the easiest and quickest to do.

wesley-tan commented 1 year ago

I was just thinking whether there's any solution to automate the process of downloading the apks? I was playing around with the JS and this python one (https://pypi.org/project/play-scraper/) I tried something like this which goes through various categories and takes the top 50 of each category which I presume could be an 'easy' option but I think it may be not a good representative of the app store if we want a diversity of different categories like lifestyle etc. Seems to me that 'games' occupy a lot of categories.

`const gplay = require('google-play-scraper');

const categories = Object.values(gplay.category); const collection = gplay.collection.TOP_FREE; // Or any other collection you want const appsPerCategory = 50; // Number of apps to fetch per category

let totalApps = [];

const getApps = async () => { for (let i = 0; i < categories.length; i++) { // Stop if we already have 1000 apps if (totalApps.length >= 1000) break;

try {
  const apps = await gplay.list({
    category: categories[i],
    collection: collection,
    num: appsPerCategory,
    fullDetail: false
  });

  totalApps = totalApps.concat(apps);

  // If we have more than 1000 apps, remove the excess
  if (totalApps.length > 1000) {
    totalApps = totalApps.slice(0, 1000);
  }
} catch (err) {
  console.error(`Failed to fetch apps for category ${categories[i]}: `, err);
}

}

return totalApps; };

getApps().then(apps => console.log(apps)); `

Moving forward I think one possible approach could be listing the 200 categories we are thinking of and using TOP_FREE? I think this may be a tad 'manual' but it would guarantee that we have a good basket of apps

n-aggarwal commented 1 year ago

apkeep is CLI tool that provides a way to download apks. It can be used with a csv file as documented in their readme to download a list of apps:

You can either specify a CSV file which lists the apps to download, or an individual app ID. If you specify a CSV file and the app ID is not specified by the first column, you'll have to use the --field option as well. If you have a simple file with one app ID per line, you can just treat it as a CSV with a single field.

Platform Control- Token Dispenser may help to download Android Apps at scale.

wesley-tan commented 1 year ago

This is the setup I am trying (tried writing this in a style that would convert well to the readme.md later on):

Step 1: Choose the Apps to Download

Use a tool like the Google Play Scraper on GitHub to help you with this. It will let you select the top 50 apps from each genre on the Play Store.

Make sure to save the list of selected apps in a CSV file. Each line of the CSV file should contain one app ID.

Step 2: Install Necessary Tools

Before you can start downloading apps, you need to install some necessary tools. In this case, you will need to install apkeep and potentially other tools like Raccoon or Token Dispenser. Here are the steps to install apkeep:

Open a terminal and run

cargo install apkeep 

to install apkeep.

If you decide to use other tools like Raccoon or Token Dispenser, you will need to follow the installation instructions on their respective GitHub pages. (I am still trying to figure out how to use TokenDispenser)

Step 3: Download the Apps

Once you have the list of apps and have installed the necessary tools, you can start downloading the apps. Here is how you can use apkeep to download apps:

apkeep -a <app-id> -d google-play -u '<your-email>@gmail.com' -p <your-password> .

Replace with the app ID from your CSV file, and @gmail.com and with your Google Play Store email and password.

Step 4: Handle Split APKs

If an app has a split APK, download all the APK files for that app. When you run the command to download an app with apkeep, it will tell you if there are additional APK files to download.

apkeep -a <app-id>:arm64-v8a -d google-play -u '<your-email>@gmail.com' -p <your-password>
wesley-tan commented 1 year ago

Using google-play-scraper, I created CSV files for the top 50 apps in the following 20 categories (totalling 1000 apps)

ART_AND_DESIGN AUTO_AND_VEHICLES BOOKS_AND_REFERENCE BUSINESS EDUCATION ENTERTAINMENT FINANCE HEALTH_AND_FITNESS MAPS_AND_NAVIGATION MUSIC_AND_AUDIO NEWS_AND_MAGAZINES PRODUCTIVITY SOCIAL TRAVEL_AND_LOCAL GAME_ACTION GAME_EDUCATIONAL GAME_SIMULATION DATING EVENTS FAMILY

This is the script I used, changing the path (e.g. dating-apps.csv) and play.category (e.g. gplay.category.DATING) each time

const gplay = require('google-play-scraper');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;  // you need to install csv-writer package

const csvWriter = createCsvWriter({
    path: 'dating-apps.csv', // change path each time
    header: [
        {id: 'appId', title: 'APP_ID'},
        {id: 'title', title: 'TITLE'},
        {id: 'developer', title: 'DEVELOPER'},
        {id: 'score', title: 'SCORE'},
    ],
});

gplay.list({
    category: gplay.category.DATING,
    collection: gplay.collection.TOP_FREE,
    num: 50
})
.then((apps) => {
    let data = apps.map(app => {
        return {
            appId: app.appId,
            title: app.title,
            developer: app.developer,
            score: app.score
        };
    });

    csvWriter.writeRecords(data)       // returns a promise
    .then(() => {
        console.log('...Done');
    });
});

I uploaded all the files into the apps-csv folder. Apologies if it was bad practice committing directly to main

wesley-tan commented 1 year ago

With regards to automating the process, I am currently facing some problems which I will be working on in the coming week.

Using the Rust and apkeep method, I repeatedly get the following error

error: linking with `cc` failed: exit status: 1
  |
  = note: LC_ALL="C" PATH="/Users/wesleysimeontan/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustl         _$LT$openssl..x509..X509VerifyResult$u20$as$u20$core..fmt..Display$GT$::fmt::hdfec200ca6d9b228 in libopenssl-2b7d39ae4fd85c04.rlib(openssl-2b7d39ae4fd85c04.openssl.ec35ac18-cgu.1.rcgu.o)
          ld: symbol(s) not found for architecture arm64
          clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: could not compile `apkeep` (bin "apkeep") due to previous error
error: failed to compile `apkeep v0.15.0`, intermediate artifacts can be found at `/var/folders/34/xy0sgrqd32qb38yhrd81b1280000gn/T/cargo-installisWUSj

Thus far, I've tried changing the SSL settings (e.g. brew install openssl) and changing the path (unset LC_ALL unset PATH)

Using the Token Dispenser, I get the following error

Traceback (most recent call last):
  File "/Users/wesleysimeontan/platformcontrol-token-dispenser/token_dispenser.py", line 37, in <module>
    with socketserver.TCPServer((HOST, PORT), TokenDispenser) as server:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 456, in __init__
    self.server_bind()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 472, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 48] Address already in use

Thus far, I've tried updating the google-protobuf package and regenerating the googleplay_pb2.py file. I will try different methods but the worst case is simply downloading the apps in the csv files manually through Raccoon, which I have tried.

kasnder commented 1 year ago

Weird that apkeep isn't working. Looks like an error with the openssl library? Maybe, raise an issue on the apkeep page with the details of the error?

As for the token dispenser, you'd need to kill python: killall python / killall python3. But then, I'm not sure if gplaycli still works. I think that had some problems in the past.

kasnder commented 1 year ago

Todo: It would be good to iterate over ALL the categories in the selection script, and to make sure to use a fake gmail.

wesley-tan commented 1 year ago

Also, I will document what exactly TOP FREE means (vs. TOP GROSSING or any other category)

wesley-tan commented 1 year ago

Todo: It would be good to iterate over ALL the categories in the selection script, and to make sure to use a fake gmail.

Also, for better accuracy/standardization, top 40 TOP_FREE from each category was used. Have done so, uploaded the updated files. More info here: https://docs.google.com/document/d/1uWub2g9tzrqejRSY8wpdKuBFjoVzU4aUdosUih5LgH8/edit?usp=sharing

This includes the following categories

const categories = [
  'GAME_ACTION',
  'GAME_ADVENTURE',
  'GAME_ARCADE',
  'GAME_BOARD',
  'GAME_CARD',
  'GAME_CASINO',
  'GAME_CASUAL',
  'GAME_EDUCATIONAL',
  'GAME_MUSIC',
  'GAME_PUZZLE',
  'GAME_RACING',
  'GAME_ROLE_PLAYING',
  'GAME_SIMULATION',
  'GAME_SPORTS',
  'GAME_STRATEGY',
  'GAME_TRIVIA',
  'GAME_WORD',
  'ART_AND_DESIGN',
  'AUTO_AND_VEHICLES',
  'BEAUTY',
  'BOOKS_AND_REFERENCE',
  'BUSINESS',
  'COMICS',
  'COMMUNICATION',
  'DATING',
  'EDUCATION',
  'ENTERTAINMENT',
  'EVENTS',
  'FINANCE',
  'FOOD_AND_DRINK',
  'HEALTH_AND_FITNESS',
  'HOUSE_AND_HOME',
  'LIBRARIES_AND_DEMO',
  'LIFESTYLE',
  'MAPS_AND_NAVIGATION',
  'MEDICAL',
  'MUSIC_AND_AUDIO',
  'NEWS_AND_MAGAZINES',
  'PARENTING',
  'PERSONALIZATION',
  'PHOTOGRAPHY',
  'PRODUCTIVITY',
  'SHOPPING',
  'SOCIAL',
  'SPORTS',
  'TOOLS',
  'TRAVEL_AND_LOCAL',
  'VIDEO_PLAYERS',
  'WEATHER'
];
wesley-tan commented 1 year ago

In the Top Free category, the apps with the most downloads that have a price of $0 make the list.

In the Top Paid category, the apps with the most downloads that have a price greater than $0 make the list.

The Top Grossing category lists the apps with the highest total revenue (that is, price * quantity sold + revenue from in-app purchases).

Given that the Google Play Store lists up to the top 45 TOP_FREE per category, I have compiled the files of the top 40 TOP_FREE from each category

SebastianZimmeck commented 1 year ago

In the Top Paid category, the apps with the most downloads that have a price greater than $0 make the list.

I do not think that it is necessary to create a list for paid apps since we can't get the APKs without paying (at least, through the official channels; to my knowledge there is a return option within a certain time period after buying an app, but that seems a bit dicey to me to do at scale; let me know if you have other information, @wesley-tan).

Given that the Google Play Store lists up to the top 45 TOP_FREE per category, I have compiled the files of the top 40 TOP_FREE from each category

Why not the top 45 from each category then (or whatever number Google has per category; they do not need to be all the same; "whatever size Google's list is" is a legitimate criterion)?

wesley-tan commented 1 year ago

I believe "whatever Google's list is" is actually >45, but only 45 appear on their visual interface when you go to Google Play etc. Thus, I think top 45 works as well! Looking at the past reports, some used numbers like 20, 30 etc. so I don't think it is a huge issue (the number always seems to be somewhat arbitrarily chosen)

And yes since we are downloading from the Play Store itself (and not just the .apk files online), we should just stick to TOP_FREE; some of the TOP_PAID and TOP_GROSSING apps are paid, so I think sticking to TOP_FREE is fine, above comment was just to give further clarity to the difference between TOP_FREE vs TOP_GROSSING

wesley-tan commented 1 year ago

As a personal note, following the use of apkeep, we could use an .sh file (something like this) to iterate through the various csv files that have been collated. In this case, the file is scrapping the `apps_ART_AND_DESIGN.csv' file, so we just need to re-run this for each csv file. (Of course this could be even further automated to just go through the csv files (each of which representing on category, I've placed them in the apps_csv folder), but I think there may be merit in checking if everything is OK after iterating through each csv file)

#!/bin/bash
EMAIL='<your-email>@gmail.com'  # replace with your email
PASSWORD='<your-password>'  # replace with your password

# Read the CSV file
while IFS=$',' read -r APP_ID TITLE DEVELOPER SCORE
do
    # Skip the header
    if [ "$APP_ID" != "APP_ID" ]
    then
        echo "Downloading $TITLE..."
        apkeep -a $APP_ID -d google-play -u $EMAIL -p $PASSWORD .
    fi
done < apps-ART_AND_DESIGN.csv # replace the csv file
kasnder commented 1 year ago

Looks good to me. Looking forward to discussing!

wesley-tan commented 1 year ago

Just had a call with Nishant and the apkeep problem is fixed now! (It turned out to be a problem with downloading rust twice and likely some problem with the path and SSL)

I've also got the .sh to run and we can indeed automate the process of downloading the .apk! I tried with a few apps and categories. Namely, I used the ART_AND_DESIGN apps and ran this particular script

#!/bin/bash
EMAIL='<email>@gmail.com'  # replace with your email
PASSWORD='<password>'  # replace with your password

# Read the CSV file
while IFS=$',' read -r APP_ID TITLE DEVELOPER SCORE
do
    # Skip the header
    if [ "$APP_ID" != "APP_ID" ]
    then
        echo "Downloading App with ID $APP_ID, titled: $TITLE..."
        apkeep -a $APP_ID -d google-play -u $EMAIL -p $PASSWORD .
    fi
done < apps-ART_AND_DESIGN.csv # replace the csv file

This downloads in the same folder as your .sh file. I will place the .sh file in our apps_csv folder

Screenshot 2023-06-26 at 10 22 34 PM

(Note: I terminated the process after 2 apps are installed (the top 2 apps), but the same principle should work with 40 or whatever number we choose)

Essentially, we can start easily downloading the apks we want :) We can discuss further tomorrow but this issue can be closed soon

n-aggarwal commented 1 year ago

@wesley-tan unfortunately the apks you provided me are not working. Most of them are crashing on the Pixel 6A. I have attached a few screenshots and screen recordings below showing the issue:

Screen Recordings: https://github.com/privacy-tech-lab/gpc-android/assets/121606501/cd3c9df6-4d74-47bb-bcaa-901c5d001568 https://github.com/privacy-tech-lab/gpc-android/assets/121606501/2518fba4-b8c4-4cb4-9b87-441e6bb25f83

wesley-tan commented 1 year ago

Oh no, I will try to figure out why this is the case asap. The apks seem to work for android studio emulator, I will try with my phone but could it be an issue with the rooted phone or settings? I will try a similar thing with the xapks accordingly

https://github.com/privacy-tech-lab/gpc-android/assets/98197696/7d93821a-745c-42a0-abd7-6de09c4098ae

SebastianZimmeck commented 1 year ago

@wesley-tan, please make sure that the APKs work for a Pixel 6a, which can be easily tested on an emulator.

@n-aggarwal, what is the Android version you have installed?

n-aggarwal commented 1 year ago

I have android 13 installed on the Pixel 6A

kasnder commented 1 year ago

Make sure we download complete split apps, including architecture for arm64.

kasnder commented 1 year ago

It seems that apkeep has a split_apk option: when set to 1 or true, this attempts to download a split apk.

This can be used like so:

apkeep -a hk.easyvan.app.client -d google-play -o split_apk=true -u 'someone@gmail.com' -p somepass .

https://github.com/EFForg/apkeep/blob/master/USAGE-google-play.md

SebastianZimmeck commented 1 year ago

Any news @wesley-tan?

wesley-tan commented 1 year ago

I tried using the method above

# Read the CSV file
while IFS=$',' read -r APP_ID TITLE DEVELOPER SCORE
do
    # Skip the header
    if [ "$APP_ID" != "APP_ID" ]
    then
        echo "Downloading App with ID $APP_ID, titled: $TITLE..."
        apkeep -a $APP_ID . #-d google-play -o split_apk=true -u $EMAIL -p $PASSWORD .
    fi
done < apps-EDUCATION.csv # replace the csv file

This downloads a mixture of .apk and .xapk

Screenshot 2023-07-06 at 9 24 58 PM

Strangely, when I open the xapk files, (after renaming them to .zip), there is no .obb file

Screenshot 2023-07-06 at 9 25 26 PM

I am able to open the .apk with the API 33 Googe Pixel 6a fine, I have uploaded 10+ apps which worked in the Google Drive Folder (apps-that-work-6jul)

I'm still figuring out why this is the case and how to fix this, but the apps that are downloaded are evidently under arm architecture

Screenshot 2023-07-06 at 9 29 03 PM

I'm going to try the APKPure methods and the other methods documented in the google-play-scraper which download apps from google-play

I tried using other methods like the Split App Installer (SAI) too (https://play.google.com/store/apps/details?id=com.aefyr.sai&hl=en&gl=US)

SebastianZimmeck commented 1 year ago

OK, I do not know the details of the downloaders we use, but with Raccoon you can specify a particular device and all APKs, XAPKs, etc. are compatible with it (because the Google Play Store thinks that this particular device is performing the download). So, maybe try Raccoon (unless there is a reason speaking against it).

Raccoon may even work out of the box.

Android devices can only download apps, they are compatible with. Play filters server side. Usually this is not a problem, since Raccoon mimics a high end smartphone, almost every developer wants to support.

wesley-tan commented 1 year ago

So, there are two solutions. I'm still checking through whether it works for most apps but so far it works for the few apps (about 10 per method) that I tried.

Method 1: Raccoon

This involves searching for the application in Raccoon itself. Using the app id found in the .csv files, we just simply search for it in Raccoon and pick the appropriate app. This is slow because the process is manual (have yet to find a way to automate the Raccoon process, and it may be confusing trying to see which app you want since Raccoon doesn't really have a way to search by the app id itself (it does a more generic search similar to a search engine etc)

Screenshot 2023-07-09 at 5 10 18 PM

Then, I used the following code to process the split apks in the Racoon folder

#!/bin/bash

# Directory containing folders with APK files
APK_PARENT_DIR="/Users/wesleysimeontan/Raccoon/content/apps"

# Device ID
DEVICE_ID="emulator-5554"

# Function to install split APKs
install_split_apks() {
    local apk_dir="$1"
    echo "Installing APKs from $apk_dir..."

    # Find all split APK files within the directory
    local apks=$(find "$APK_PARENT_DIR/$apk_dir" -name "*.apk")

    # Install the APKs using adb
    adb -s $DEVICE_ID install-multiple $apks
}

# Ensure ADB is connected and can see device
adb devices

# Main script execution

# Get all subdirectories in the parent directory
app_dirs=$(ls -d "$APK_PARENT_DIR"/*/)

# Install each set of split APKs
for apk_dir in $app_dirs; do
    # Extract the directory name from the path
    apk_dir=$(basename "$apk_dir")

    # Install the APKs
    install_split_apks "$apk_dir"
done

Basically, this installs the apks into the phone. As of typing this, I tried about 10 apps including Duolinguo and this seems to work.

Method 2: Working with the xapk files


#!/bin/bash

# Directory containing XAPK files
XAPK_DIR="/Users/wesleysimeontan/Desktop/ptl-google-play-scraper/install-apps-here/apps-9-jul"

# Function to install an XAPK
install_xapk() {
    local device_id="emulator-5554"
    echo "Installing $xapk on device $device_id..."

    local temp_dir=$(mktemp -d -t xapk-XXXXXX)
    unzip "$XAPK_DIR/$xapk" -d "$temp_dir"
    local apks=$(find "$temp_dir" -name "*.apk")

    # Specify the target device
    adb -s $device_id install-multiple $apks
    rm -rf "$temp_dir"
}

# Main script execution

# Ensure ADB is connected and can see device
adb devices

# Find all XAPK files in the XAPK directory
for xapk in $(ls "$XAPK_DIR"/*.xapk); do
    # Extract the filename from the path
    xapk=$(basename "$xapk")

    # Install the XAPK
    install_xapk "$xapk"
done

This seems to work for the few apps I downloaded but I think Method 1 may be more 'reliable' because you can guarantee it downloads in arm architecture. Think the problem why the previous ones may not have worked may be because of different in architecture.

Will continue to try more of the applications

wesley-tan commented 1 year ago

Have found a solution:

First, I used this as my play-store-downloader.sh

#!/bin/bash
EMAIL='<email>'  # replace with your email
PASSWORD='<password>'  # replace with your password

# Read the CSV file
while IFS=$',' read -r APP_ID TITLE DEVELOPER SCORE
do
    # Skip the header
    if [ "$APP_ID" != "APP_ID" ]
    then
        echo "Downloading App with ID $APP_ID, titled: $TITLE..."
        apkeep -a $APP_ID -d google-play -o split_apk=1 -u $EMAIL -p $PASSWORD .
    fi
done < apps-AUTO_AND_VEHICLES.csv # replace the csv file

This installed a mixture of split apk and apk files:

https://github.com/privacy-tech-lab/gpc-android/assets/98197696/a2f534f3-fae4-454f-a46e-3fb1c56e45b0

(basically some files are installed as apk and some as split apk)

Then, I use install-split-apk.sh

#!/bin/bash

# Directory containing folders with APK files
APK_PARENT_DIR="/Users/wesleysimeontan/Desktop/ptl-google-play-scraper/install-apps-here/test-folder-workflow"

# Device ID
DEVICE_ID="emulator-5554"

# Function to install split APKs
install_split_apks() {
    local apk_dir="$1"
    echo "Installing APKs from $apk_dir..."

    # Find all split APK files within the directory
    local apks=$(find "$APK_PARENT_DIR/$apk_dir" -name "*.apk")

    # Install the APKs using adb
    adb -s $DEVICE_ID install-multiple $apks
}

# Ensure ADB is connected and can see your device
adb devices

# Main script execution

# Get all subdirectories in the parent directory
app_dirs=$(ls -d "$APK_PARENT_DIR"/*/)

# Install each set of split APKs
for apk_dir in $app_dirs; do
    # Extract the directory name from the path
    apk_dir=$(basename "$apk_dir")

    # Install the APKs
    install_split_apks "$apk_dir"
done

It will then show something like this and the adb would install the apk files in my Android virtual device.

Screenshot 2023-07-11 at 12 14 34 AM

(both the apk files from the split apk files and those that were already downloaded as apk files were downloaded)

All could be successfully opened (tried with another category as well)

SebastianZimmeck commented 1 year ago

OK, @wesley-tan, can you pass 15 or so apps to @n-aggarwal for testing on a real device?

wesley-tan commented 1 year ago

https://drive.google.com/drive/folders/1uYn7AnMs2SRPdcuxGV6oWIhhqcKLi7Vi?usp=drive_link

@n-aggarwal the files can be found here, tell me if there are any issues

SebastianZimmeck commented 1 year ago

@wesley-tan will close this issue once @n-aggarwal is able to run the test set apps.

n-aggarwal commented 1 year ago

All the apps in the "apps-that-work" folder work! I have also integrated the install_split_apks function in my script to automate the installation!

wesley-tan commented 1 year ago

All the apps in the "apps-that-work" folder work! I have also integrated the install_split_apks function in my script to automate the installation!

Great :)) I will start downloading the apps and placing in the google drive, I'll categorize them by cateogry and document when and how I download them

kasnder commented 1 year ago

great news. well done!!

n-aggarwal commented 1 year ago

@SebastianZimmeck do we want to revise the app list as you mentioned in #78? If so, then perhaps @wesley-tan you should wait before downloading the apps!

SebastianZimmeck commented 1 year ago

Good point, @n-aggarwal! I commented there.