privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
3 stars 1 forks source link

Set up systematic way to decode GPP strings for the long term #89

Closed Mattm27 closed 5 months ago

Mattm27 commented 5 months ago

We should begin looking into new ways to decode GPP strings after a crawl. Currently, we have translated the JavaScript code from this IAB GPP Decoder, to Python but every time something changes on their end we must update our Python scripts accordingly.

Also, the encoded and decoded GPP strings should both be visible in the Google sheet containing crawl data.

katehausladen commented 5 months ago

The code from this JS library (more specifically, the node modules after the library is installed) has been converted to python for decoding in our colab because (1) the library is intended for use in websites (i.e. we can't directly use their JS library without being connected to a web server) and (2) we need it to be python to integrate it into colab notebooks. (note that I stripped down the files to only contain what is necessary for us to decode GPP strings)

We would need to update this python code whenever a new section is added to GPP. This would require looking at the new node modules for the library and adding any code that pertains to the new section to our python version. Specifically, this would require doing the following 4 things:

  1. In the JS library, look in node_modules/encoder/field for the field file for the new section. Then, create a file in the python version that translates that file. This is really simple, as this file will just list the fields for that section. for example:

    Screenshot 2024-02-12 at 12 12 15 PM
  2. In the JS library, look in node_modules/encoder/section for the section file for the new section. Then, create a file in the python version that translates that file. Only the functions found in the other python section files need to be translated. This is the most involved step, but there are 9 example translations to help with the syntax change.

  3. In the python version, look for fake_node_mods/encoder/GppModel.py. The new section will need to be imported and added to the decode function.

    gppmodel
  4. In the python version, look for fake_node_mods/encoder/section/Sections.py. Add the new section.

    sections

It may be useful to just search for the new section name in the entire cmp_api_python folder to be sure it doesn't occur in any other files (one of the EU sections appears in cmpapi_test.py). Instructions of how to test the cmp_api_python locally on your computer for debugging purposes are already in a readme in the cmp_api_python folder in Google drive.

katehausladen commented 5 months ago

The decoded GPP strings are now in the Crawl Data google sheet.

If we don't want to use the python library, the only other thing I could think of is essentially maintaining our own GPP decoding website using the JS library that would essentially function as an api. This way, we could ping it with the GPP string and get a json response of the decoded string. However, this may require more effort than maintaining the python decoding code.

I did try pinging https://iabgpp.com/# with a GPP string using python's requests library. There was no json response, and the text attribute also didn't contain anything useful to us.

SebastianZimmeck commented 5 months ago

If we don't want to use the python library, the only other thing I could think of is essentially maintaining our own GPP decoding website using the JS library that would essentially function as an api. This way, we could ping it with the GPP string and get a json response of the decoded string. However, this may require more effort than maintaining the python decoding code.

Good idea! I agree, though, for the time being, we can keep the Python approach.

katehausladen commented 5 months ago

I added a paragraph to the wiki explaining that we use the Python library for decoding. I linked the various IAB resources (the GPP repo, JS library for encoding/decoding, decoding/encoding website) and the instructions of how to update the python library. I think for now, this is sufficient, and an alternative decoding method can be discussed if keeping the python library up to date becomes overwhelming.