Perform June crawl - Githubissues

privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale

https://privacytechlab.org/

MIT License

4 stars 2 forks source link

Perform June crawl #108

Closed SebastianZimmeck closed 3 months ago

SebastianZimmeck commented 5 months ago

@franciscawijaya will perform the crawl (with possible help from @katehausladen).

franciscawijaya commented 3 months ago

The crawler is in progress right now. Things to take note for me:

Routinely checks the crawler to make sure everything is going well and the crawler does not stop halfway (if it is, take steps to restart it from the point that it stops)
Familiarize myself with the Google Collab

franciscawijaya commented 3 months ago

As per our conversation on call, I have compared the result of some sites on the June crawl pt1 and April crawl pt1 and they looked similar -- the crawl is looking great, so far! Just to track the progress, I'm currently on the third set and in the meantime, I'm changing all the formats for the json files to be readable and also reading through the Google Collab.

SebastianZimmeck commented 3 months ago

Nice, @franciscawijaya!

SebastianZimmeck commented 3 months ago

@franciscawijaya, can you add the following URL to the end of the crawl list and include it in your crawl going forward?

https://www.washingtonpost.com/

(cc'ing @AramZS)

franciscawijaya commented 3 months ago

Added to the 8th batch!

franciscawijaya commented 3 months ago

The last batch of the crawl is now done and everything looks great so far!

Next step: I will now begin to parse and analyze the crawl data which would be finished latest by our Thursday meeting.

SebastianZimmeck commented 3 months ago

Excellent! Great news!

franciscawijaya commented 3 months ago

Update: I have transferred all of the crawl data to the Google Drive and am starting to collate the redo_sites.csv using the Google Collab now that we have all the data. However, I'm currently facing an error when running one of the lines of code and have been struggling to figure out where it went wrong. But, I have reached out to labmates for output and I will continue on debugging.

franciscawijaya commented 3 months ago

Solved! I'm now running the redo sites (Google Collab collated 720 sites without subdomains to be crawled).

franciscawijaya commented 3 months ago

An update: I have crawled the redo sites and I also tried to run the well-known script. However, I faced a problem in which only the well-known-data.csv was updated and not the well-known-errors.csv once the code fully ran. Nevertheless, I think I found out what the problem was and am now re-running the script. Hopefully, I would have both the well-known-data and well-known-errors by tomorrow morning and can then start working on parsing and analyzing to get all the figures too.

SebastianZimmeck commented 3 months ago

As discussed, @franciscawijaya if you can do the following:

[x] Create/update crawl release
[x] Update crawl documentation as necessary

Then, feel free to close this issue.

franciscawijaya commented 3 months ago

As mentioned in the meeting, while I am done with the crawl, I am still figuring out the parsing/analyzing of the data and creating the figures which would be my task for this week. I will be closing this issue now as I'm done with the crawl and will make a new issue in gpc-web-crawler-paper to post the figures and data once I finish working out the analysis and figures.