Fix duplicate and empty entries when creating Google Sheet

franciscawijaya commented 3 weeks ago

June Crawl collects this for yelp which seems like it does not successfully collect any data for Yelp.com: "id": 1, "site_id": 0, "domain": "yelp.com", "sent_gpc": 1, "uspapi_before_gpc": null, "uspapi_after_gpc": null, "usp_cookies_before_gpc": null, "usp_cookies_after_gpc": null, "OptanonConsent_before_gpc": null, "OptanonConsent_after_gpc": null, "gpp_before_gpc": null, "gpp_after_gpc": null, "gpp_version": null, "urlClassification": "{\"firstParty\":{},\"thirdParty\":{}}", "OneTrustWPCCPAGoogleOptOut_before_gpc": null, "OneTrustWPCCPAGoogleOptOut_after_gpc": null, "OTGPPConsent_before_gpc": null, "OTGPPConsent_after_gpc": null

For the other sites, it seems like it is similar/identical to the previous crawl but we need to recheck just in case a similar problem occurred too.

franciscawijaya commented 3 weeks ago

Action plan on my end: 1) Try crawl exclusively just for Yelp.com 2) Manual crawl for Yelp.com 3) Check if it collects any data for Yelp.com or if it is recorded in the error-logging document 4) Check if the crawl collects no data in the same format for any other sites 5) Spotchecking the crawl for sites with data (check if their cookie values and other information are similar to the previous crawl)

franciscawijaya commented 3 weeks ago

Update: I believe that I found where the error is coming from.

I tried the crawl on Yelp.com and it gives me the same result. Carefully observing the crawl during the process, I realized the site could not even be accessed in the first place

This prompted me to check our extension xpi, following the wiki, manually to see if there's anything wrong with our code. I got the same result:

My initial hypothesis was that it might be the case that the site is currently down. So, I just checked the site without running the extension or the crawl and the site is still blank. This led me to downdetector servers/sites and they all said that the site is up and running.

Finally, I turned off the VPN and sure enough, the site is up and running. So, my conclusion is that the site blocked me because it detected the VPN as it seems like there is no problem with the code or our crawl in general.

Next: I'm going to double check other sites if there are similar cases to see if this is indeed the cause of the problem

SebastianZimmeck commented 3 weeks ago

Ah! Excellent detective work, @franciscawijaya!

Next: I'm going to double check other sites if there are similar cases to see if this is indeed the cause of the problem

Yes, indeed, that is a good idea. I believe we do not have a dedicated VPN error category. But maybe we should add that for next time if it is possible to pinpoint this error to the VPN. If not, maybe we can flag if a site does not lead to any data as a "no data" analysis or something similar. That way, we can consider what we do with those in our analysis. It would strike me as rare that there is any site with no data. We could add those to the redo list and if the second attempt fails, let it be.

For the concrete situation right now, it would be helpful to compare how many sites had no data in the previous crawl(s) compared to the most recent one and which sites (e.g., Yelp worked before but no longer works; site xxx works now but failed previously). If there is only a gradual increase/decrease/change, I would imagine that to be a natural evolution. But especially if the latest crawl has many more of such failures, it would point to some change on the VPN end, on our end, ...

It may also be worthwhile to check different Mullvad VPN LA locations. They have various. Maybe, the result is different for a different VPN? If so, does that mean in our next crawl we should use another VPN location, rotate those, do some spotchecks before every crawl to find out which LA VPN to use?

Just some ideas for further exploration ... not meant to be comprehensively done. Just some ideas. Maybe, you have better ones, @franciscawijaya, especially once you learn more about this issue ... Generally, it would be helpful to (1) better understand the current data and adjust our analysis accordingly and (2) see if we can change something for the future crawls.

franciscawijaya commented 3 weeks ago

It may also be worthwhile to check different Mullvad VPN LA locations

Yes, I checked a different Mullvad VPN LA Locations and the crawler finally got the result that is identical to the April Crawl. This seems to confirm that the problem stems from the use of VPN.

For the concrete situation right now, it would be helpful to compare how many sites had no data in the previous crawl(s) compared to the most recent one and which sites

So far I have manually checked up to 60 sites that has no data in June Crawl and compared it to our April Crawl. Out of 60 sites only one site has the same predicament as Yelp (ie. it has a valuable data in April Crawl (eg. gpp strings) but no data in June crawl; this site is findagrave.com).

Furthermore, other than checking and comparing the absence of data when carefully going through these 60 sites, I also realized that there are a number of sites that has consistently no data collected in the previous crawls but not due to errors like HumanCheckError, TimeoutError etc. I particularly suspected this for big name sites like meta.com, mozzila.com, GroupMe.com that probably has a strong security to detect and fends off users with VPN.

However, if the nature of the error is a blank page as attached in my previous comment, I'm not quite sure if we can immediately conclude every blank page error scanned by the crawler to a VPN error. This is because a blank page does not have a strong indication of the error coming from the usage of VPN (unlike for errors like HumanCheckError that has sentences like 'verify that you are a human' on the page)

franciscawijaya commented 3 weeks ago

To-do

[x] Debug and fix the Google Colab so that it prints the data properly to the Crawl-Data without repetition
[ ] Collate the data for the April Crawl and June Crawl that has no data like below "uspapi_before_gpc": null, "uspapi_after_gpc": null, "usp_cookies_before_gpc": null, "usp_cookies_after_gpc": null, "OptanonConsent_before_gpc": null, "OptanonConsent_after_gpc": null, "gpp_before_gpc": null, "gpp_after_gpc": null, "gpp_version": null, "urlClassification": "{"firstParty":{},"thirdParty":{}}", "OneTrustWPCCPAGoogleOptOut_before_gpc": null, "OneTrustWPCCPAGoogleOptOut_after_gpc": null, "OTGPPConsent_before_gpc": null, "OTGPPConsent_after_gpc": null
[ ] Compare the error messages for sites collected with no data and determine which category they fall into: 1) April no data --> June data 2) April data --> June no data 3) April no data --> June no data (null error message for both)

franciscawijaya commented 2 weeks ago

Update on the repetition: I have been debugging for the past few days to find out where the repetition/loop is coming from but I unfortunately have yet to find it. Nevertheless, these are some findings that I got from my debugging and printing:

When making full_df list, we have the correct presentation of data (11709 rows -- 1 more than the April Crawl as we added a new site for the June crawl).

After correcting the redo data and making dict of sites that were not crawlable, we also still have the right calculation.

So, my suspicion is that the problem is coming from this last column as that is where the calculation ballooned. However, after debugging and printing, I have yet to find the line of code that is causing the error.

I still have one last thing to try: redoing the parsing for April Crawl data on a new tab April 2024 (2) on a new tab (to not mess up the already collected and parsed April data) and see the the process and outcome. I might have more things to look for after investigating the result of this said experiment but if not, I also have asked for Matt's help to get a second pair of eyes just in case I miss anything. I have also tried to contact Kate to inquire if she has encountered this kind of problem before.

SebastianZimmeck commented 2 weeks ago

Sounds good! Possibly, @n-aggarwal can help as well.

franciscawijaya commented 2 weeks ago

Update on the duplicates: The duplicates have been removed.

I tried redoing the parsing for April data and it does not give me duplicates

Furthermore, since the duplicates were not present after line 58, it seems that it was not a problem of the data collection but rather something from the Colab itself. Also, given the findings,

these are some findings that I got from my debugging and printing:

I also believe it's coming from the last column of the collab since when the data is being parsed, there were no duplicates; they only occur when we are printing and transferring it to the Google Sheet.

Solution: Special thanks and credit to @Mattm27 for suggesting to remove the duplicates based on their site-id by adding on to our Google Colab code. Now, our June crawl data is now in the same format as the April's one.

Next step: I'm now figuring out the best way to collect and compare the April and June result carefully to determine the difference that we want to see because there are three different things that we want to look out for.

franciscawijaya commented 1 week ago

An update on my debugging process: The past week I have tried to debug primarily using minimum viable test. I started out with creating my own sample size but when running the test, I realized halfway through the process that it's hard to account the other possible files that are mounted and used on the collab (eg. all the redo, well-known must match with the minimum sample size).

With the same testing strategy, I used a different approach: comparing the results purely for just one batch (namely pt1). This approach worked better because I got to see compare the different results from the two different batches with just adding logic of some exceptions in the code (eg. there was not UnexpectedAlertError so I just added if statements, etc).

During the process, I also took the time to read the code line-by-line and comprehend them and there are some references and execution of other files that are in processing analysis data on a deeper level which I had never manually used previously. After reading through these other files referenced and executed on the Colab, I believe the problem does not stem from any of these files as they were simply responsible for the decoding of GPP library and the python interpreter.

After this analysis, I then decided to strategically focus on printing and debugging the Google Colab to see where the error is coming from exactly. After debugging, I believe I have found where the error actually stems from. Nevertheless, while I have identified which block of code are exactly causing it, I'm still not sure which of these five lines are causing the duplicates to the June crawl. I'm hoping to continue working on this for this week as I feel that I'm getting close to finding the root cause of it. The past week most of the debugging was able to be done locally on my own but I am hoping to get a second opinion from the team for this particular part of the code that is causing the duplicate. If, however, we ended up not being able to completely fix it by the end of this week, I think we could still go ahead with the July crawl next week as at least we know that the cause of the duplicates seemed to not be from the raw data since the duplicates only occur in the later part of the analysis.

Above is where the duplicates first started to appear: each site is repeating uniformly 3x

Above is where the final duplicates eventually are printed: each site is repeating in arbitrary frequency (ie. one site could repeat 10 while the other 4 times)

A side update: Matt and I also have delegated task on the data analysis part. While I am working on to figure out the problem with parsing, Matt would be working to create the figures by running the data analysis. Although this figures can be created just by running the code, it might be the case that it would only successfully create the figures once the duplicates are removed. Hence, should it be the case that the duplicates could not be removed after I maxed out all the possible methods to find out the root cause, I think a strategic approach to just create the June figures would just be removing the duplicates brute-force with the previously written codes for removal of duplicates based on same site-ids.

SebastianZimmeck commented 1 week ago

but I am hoping to get a second opinion from the team for this particular part of the code that is causing the duplicate

@zatchliu, can you communicate with @franciscawijaya and look into that?

franciscawijaya commented 1 week ago

The duplicate problem is resolved!

As discussed in today's meeting, I looked through the well-known data manually and made some important observations:

Comparing the April data and June data for well-known side by side, it seems to follow the same order at the start until site number 8204. from the first line until 8204, the order is the same and there are only some sites that have different outputs for request status (eg. 404 or 403)
There are differences from line 8204 until 23080 where these lines just consist of a copy of the same list of sites with the same order (ie. the ordered list from abercrombiekent until anchoragepress (and everything in between) appear twice in the June Crawl.

So, after identifying the duplicate of this ordered list, we can conclude that this was the root cause of the duplicates. Once I manually removed these lines of duplicates, we no longer have duplicates in our google sheet.

Next: finding out what is causing this duplicate in the well-known data:

My current hypothesis is the duplicates occurred when I had the problem of running the python well-known script where it initially failed. While I eventually resolved the problem (and updated the Readme in regards to how I should have properly ran the script), the successful output seemed to have been mistakenly saved on top of the failed output based on the nature of how the duplicates appear in the well-known file data.
This is corroborated by the fact that comparing the duplicate and the actual data, there are sites that initially gave an output of 404 for request code status (suggesting the server could not find the requested resource) and later on gave 'None' for the request code. Thus, I removed the duplicates accordingly with such observation. (eg. autodiscover.americanpublicmedia.org)

SebastianZimmeck commented 1 week ago

The duplicate problem is resolved!

Great!

Thus, I removed the duplicates accordingly with such observation.

@franciscawijaya, how do you know which of multiple entries is the valid one (assuming that they are not identical but, for example, have a different response code, say 200 instead of 404)?

franciscawijaya commented 1 week ago

The duplicate problem is resolved!

Great!

Thus, I removed the duplicates accordingly with such observation.

@franciscawijaya, how do you know which of multiple entries is the valid one (assuming that they are not identical but, for example, have a different response code, say 200 instead of 404)?

There were some considerations taken into consideration when I chose which entry is the valid one.

While most of the sites appear twice, yelp.com - abc27.com appears three times, suggesting the attempt of running the script that broke during the process.
Recognizing that the duplicates are ordered and a new list is starting in the middle (ie. a new list starting with yelp before the previous list finishes), my intuition was to pick the latest one (the one in the later number).
Removing the extra list of yelp.com - abc27.com, I was then left with two lists of the full length of our crawl 11709 consecutively. So, I made a side-by-side comparison between the two. Looking through the list and comparing the differences manually, I recognize that some of the notable differences are 404/403 server code and Null server code. I reconfirmed and clarified that null response code implies timeout error which means it is equivalent to 404 indicates the server cannot find the requested code or 403 which indicates we don't have access to the site during the running of the script.

The tipping point for me to choosing the second/latest list was this finding that in the first list towards the end a lot of the error server code came 202 status code which suggests that the request has been accepted for processing, but the processing has not been finished yet. On the other hand, the latest list gives a different status code for these sites; namely, a fixed and clear 404 or 403 code.

franciscawijaya commented 1 week ago

What I plan to do today is to rerun the python script to confirm and see:

if the attempt fails
if the attempt succeeds and then see if the list is repeating
compare the result with the chosen entry discussed above (I do recognize that it might change as we are running the script at a different time and some sites may have changed something within that period of time but I would still like to observe the general nature of the list).

SebastianZimmeck commented 1 week ago

rerun the python script to confirm and see

Sounds good!

Also, try reconstructing as much as possible of the well-known data from your June crawl. The most important point is whether a site has a .well-known and, if so, what the retrieved value is. If we are lucky all duplicates have the same value (or we can tell which is the valid well-known crawl).

While most of the sites appear twice ...

I was then left with two lists of the full length of our crawl 11709 consecutively ...

@franciscawijaya, do you mean the sites appear twice/you have two lists because one list is from the April crawl and the other from the June crawl?

franciscawijaya commented 1 week ago

@franciscawijaya, do you mean the sites appear twice/you have two lists because one list is from the April crawl and the other from the June crawl?

The two sites that I compared were the duplicates that appear in June crawl. The first comparison that I did was between April and June which allowed me to identify that there are two lists in June since the June list is double the size the April and a side-by-side comparison between the two helped me identify where the duplicate started appearing.

After that, I copy the second list that appears in the June crawl and pasted it in a new csv file to compare it with the first list on the June crawl and made the various aforementioned observations.

Also, try reconstructing as much as possible of the well-known data from your June crawl. The most important point is whether a site has a .well-known and, if so, what the retrieved value is. If we are lucky all duplicates have the same value (or we can tell which is the valid well-known crawl).

Got it! I will provide updates once I'm done with the new crawl and comparison.

franciscawijaya commented 1 week ago

I have just finished the rerun for the python script and analyzed the difference between the June crawl (the latest updated one without the duplicate) and the recent crawl and I think we are good to go!

Some important observations for that conclusion:

Most of the differences of the two list are coming from the request code (ie 429,440,500, None) which is not the most important data that we are trying to get.
There were some notable differences on the well-known data from the recent crawl but based on the analysis all of these differences are recently updated (ie. previously no json data of {'gpc': True, 'version': 1} ) as all the different outputs suggest that they are just recently updated (meaning it just recently got added after our first June crawl) so it makes sense that our June crawl that was done in early June did not record the recent update.

Some examples are:

The only other difference in well-known data is for yelp.com. Since we have identified last time that it was because of a VPN problem (we have been using the same LA VPN IP address and when we turned off the VPN or changed it to another LA VPN IP address, it worked), in this rerun, yelp.com has an output for well-known.
Other than that, I checked manually the two csv files side-by-side from top-to-bottom that all the well-known data is exactly the same with the June crawl.

Hence, we can conclude that the duplicate problem really came from the fact that the crawl failed in the middle last time and the latest list on the June crawl (ie. the second list in order) is the accurate one based on the comparison with the rerun.

SebastianZimmeck commented 1 week ago

Thanks, @franciscawijaya!

... as all the different outputs suggest that they are just recently updated ...

If a site does not have a .well-know/gpc.json, we generally expect a 404 response code returned (except for the edge cases you already document in the readme)? If so, can you add that to the readme?

... the latest list on the June crawl (ie. the second list in order) is the accurate one based on the comparison with the rerun.

We now have all the following?

A correct well-known list from your June crawl
A correct merged Google Colab
Fixed crawler code/protocol so that the issue does not repeat in the July crawl

franciscawijaya commented 1 week ago

I have updated the readMe.

Yes, I have added the crawler code/protocol last time (especially for the well-known script that failed last one time before eventually running successfully). Under this protocol, the rerun that I did yesterday ran without failing.

I have also confirmed that we have the correct well-known list for June crawl after removing the duplicates that came from the failed attempt and have verified this data with the recent rerun (based on the observation made in the previous comments in the issue)

We no longer have duplicates in the well-known list and Google Colab. However, after our team meeting today, I reconfirm the Google Colab and I realized that the new additional site https://www.washingtonpost.com/ (https://github.com/privacy-tech-lab/gpc-web-crawler/issues/108#issuecomment-2166541276) does not appear in the final crawl_data for June. I inspected more on this problem and I found that this is because we do not have the necessary details of the site (especially the Tranco rank)

ie. it appeared in the google colab before we merged the crawl data with Tranco rank (the crawl collected data for the site as it is in the raw data but it does not appear when we merged with the Tranco rank details as we do not have the Tranco details for this site)

I double-checked with our master file for Tranco details and sure enough the Washington Post is not there. All of the sites that we used for our crawl is present up until theaquaponicsource.com which has been the last site for our crawl for the past crawls (ie. site number 1-11707 in the Web-crawl-domains are the ones we have been using for the past crawls). So, we are only missing the data for the new site that we added to the crawl.

We have data for 14614 sites sorted by Tranco rank in this web-crawl-domains file, and unfortunately washingtonpost isn't in the list (which I would assume means that it is not in the top 14614 ranks).

I have also tried to find these details in the content of the full csv files (see attached for which files I am referring to) which contain the full list of tables of sites that we scraped, sorted by technology and location. However, washingtonpost is not present in these original files as well. I would appreciate any input on this :)

SebastianZimmeck commented 1 week ago

However, washingtonpost is not present in these original files as well. I would appreciate any input on this :)

OK, no worries! For the June crawl data and going forward, let's take washingtonpost.com out of the list of sites to crawl and analysis results.

We can crawl washingtonpost.com separately and report the results directly to @AramZS.

franciscawijaya commented 1 week ago

Noted! Closing this issue for now as the duplicate problem has been resolved.

privacy-tech-lab / gpc-web-crawler

Fix duplicate and empty entries when creating Google Sheet #119