Deal with null entries in post-crawl data analysis

franciscawijaya commented 2 months ago

As we have identified from the April and June crawl, there has always been sites with empty entries (No Data). . For the data for these two months, there are around 900+ empty entries.

I have been brainstorming about what to do with these entries that have been present for the past crawls as well. I was initially thinking of including it in the error page but I'm not sure if that would be the best move, given that the null entries for one site could be caused by a different reason than another site with null entries (ie. it is not a definite and given human check error, Insecure Certificate Error or Web Driver Error).

Right now, I am thinking of just showing these empty results in our data analysis, maybe creating a column for creating figures of the percentage of sites in our data set that give empty entries for the month's crawl?

SebastianZimmeck commented 2 months ago

Right now, I am thinking of just showing these empty results in our data analysis, maybe creating a column for creating figures of the percentage of sites in our data set that give empty entries for the month's crawl?

Yes, that works! We can deal with the null entries at the data analysis step. We do not need to create figures for the null failures, but we should be able to say x% of a given crawl had all null values, just like we are able to say that y% of sites had a certain error.

For both null values as well as any errors, we should not include these sites in any analysis statistics or figures we present as those do not have meaning for the analysis.

Do you have a thought what causes the null entries? If we do a second attempt, would that lead to fewer null entries (i.e., apply the same approach as for errors)?

franciscawijaya commented 2 months ago

Yes, that works! We can deal with the null entries at the data analysis step. We do not need to create figures for the null failures, but we should be able to say x% of a given crawl had all null values, just like we are able to say that y% of sites had a certain error.

Got it! I'll be working on creating the code to do the data analytics for the percentage of null error.

Do you have a thought what causes the null entries? If we do a second attempt, would that lead to fewer null entries (i.e., apply the same approach as for errors)?

Based on my manual observation of the list of the sites with null entries, I don't think there's a fixed way to generalize one specific cause for the null entries. However, as discussed in the previous issue, we concluded that yelp.com was a VPN issue (potentially because we have been accessing the site using the same LA VPN IP address) because when I did another crawl after June crawl with the same VPN it still failed but succeeded when I changed to another LA VPN with different IP address.

While this is the case for yelp, we can't say for sure that the cause for null entries is the same for all the other null sites. Nevertheless, looking at the null entries that have been consistently present in the previous crawls, especially some of the big names like meta.com, apple.com, I suspect that the cause for the null entries is something along the same line -- that is they recognize or block the VPN IP address. Although it is curious that they just give a blank page instead of an explicit 'Access Denied' page, as discussed in issue 51

Since I have successfully collected the sites that gave empty entries in June (which is a slightly shorter list compared to the April list), I will also try to do another crawl on just this list of sites that gave null entries as I've only tried for yelp last time to troubleshoot what was causing the change to null entries. Since I tried yelp with a different VPN LA IP address last time to check if my hypothesis was true, I will also be following that methodology for the re-crawl today.

SebastianZimmeck commented 2 months ago

Sounds good!

We do not necessarily need to figure out the reason for the null entries for sure. But it is nice for the paper to say that the VPN IP address blocking is the reason for at least some.

As we discussed, maybe you find that a different LA VPN address for the next crawl results in fewer null entries. We may also update the crawl protocol slightly by doing a second crawl for all null entries (just like we do for other errors) with a different LA VPN.

franciscawijaya commented 2 months ago

I finished the re-crawl specifically for the sites that output null values for all and manually looked through the result.

Some important observations:

While doing the crawl, I realized that not all these sites did not necessary give a blank page like yelp.com during the original june crawl (eg. I waited in front of the computer for a while during the crawl and I realized that the crawl was able to access and visit apple.com); some sites also gave a login page or even an explicit error page
From this my first conclusion is that: not all of the sites that gave null error in June crawl (and I suspect the same for April crawl) is caused by VPN error as my first hypothesis
After the crawl finished, I manually checked the analysis.json and all of them have the same null values with the exceptions of these 4 sites. However, I'm not sure if these exceptions give meaningful information so I would love to clarify on this.
The only notable output that is certain is just yelp.com (which has been confirmed in previous issue in a crawl just for yelp.com
and also in this re-crawl)
I also manually looked through the error-logging.json and I found something interesting: some of these sites that were previously flagged with errors as just null have their errors explicitly identified (ie. "WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError"). So, this could give explanation to the previous question of what causing these null entries. It might be the case that in the previous crawl, the site has yet to successfully redirect to an error page, or the login page or humancheck error within the timeout period of the crawler. In other words, the null entries is a good sign of the site to be error; whether the crawl identifies exactly what the error is within the timeout period is arbitrary. This is because (in reference to my second observation), the data itself does not change (ie. they still have null entries), the only difference is just that the exact reason is identified in the new re-crawl.

I doublechecked this conclusion by looking through our result from previous crawls for sites with specific errors identified like HumanCheckError and all of their entries also give null outputs; only the error column is filled with the identified errors.

We do not necessarily need to figure out the reason for the null entries for sure. But it is nice for the paper to say that the VPN IP address blocking is the reason for at least some.

I think the cause for the null entries is an amalgamation of possible reasons. For example, yelp.com seems to be an explicit VPN issue because of the experiment that I did a few weeks ago and reconfirmed with this crawl (went from blank page to be able to access it after changing the IP address). There are other forms of such VPN error. For instance, "Access Denied" could also be a problem of VPN IP address being blocked.

However, there are also other causes to the error like "WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError" that were identified for these sites in the re-crawl as mentioned in my third point of observation above.

franciscawijaya commented 2 months ago

Outcome: We wanted to explicitly identify the sites with only null entries. We noticed that these null entries are present in the previous crawls with roughly similar number of sites.

We found that these null entries is similar to an error. After doing a re-crawl of sites with previously null entries, we found that these null entries indicate a precursor to an error; in the re-crawl, our crawl identified and flagged some of these sites as WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError" that may have caused our ability to access these sites' data and thus returned null entries.

We also found for some of the other sites, it could be the case of VPN error. For instance, after doing the re-crawl with a different VPN IP address, we managed to get data for the previously empty entries for yelp.com. At the same time, we also noticed sites that still blocked access because they potentially recognized our VPN IP address.

SebastianZimmeck commented 2 months ago

Well said, @franciscawijaya!

For the future we will:

Use a different VPN LA address than the one that possibly caused yelp to have all null values
Add some code in the Colab to calculate the percentage of sites with all null values
Find for which figures that @katehausladen created, if any, we need to take out all null values and/or error sites

(If any of these warrant more discussion, feel free to open a new issue. But it is also OK to address these points here if the answers are straightforward.)

franciscawijaya commented 2 months ago

Update: I have added the code in the Colab to calculate the percentage of sites with all null values and have also made sure the figures for monthly data analysis did not use any of the null values and/or error sites for their calculations. (For June, these null values and/or error sites made up 8.47% of our crawl list of 11708 sites).

Misc. notes: in my calculation of the percentage of sites with all null values, I identified and included both the sites with empty entries that we have been discussing (all null values but gpc was sent and status was added) (1) and sites that also have null values due to an explicit error (gpc was not sent from the start and status was 'not added') (2)

Examples for these two different natures of null values for reference:

Site URL | site_id | status | domain | sent_gpc | uspapi_before_gpc | uspapi_after_gpc | usp_cookies_before_gpc | usp_cookies_after_gpc | OptanonConsent_before_gpc | OptanonConsent_after_gpc | gpp_before_gpc | gpp_after_gpc | urlClassification | OneTrustWPCCPAGoogleOptOut_before_gpc_x | OneTrustWPCCPAGoogleOptOut_after_gpc_x | OTGPPConsent_before_gpc_x | OTGPPConsent_after_gpc_x | usps_before_gpc | usps_after_gpc | decoded_gpp_before_gpc | decoded_gpp_after_gpc | USPS implementation | error | Well-known | Tranco | OneTrustWPCCPAGoogleOptOut_before_gpc_y | OneTrustWPCCPAGoogleOptOut_after_gpc_y | OTGPPConsent_before_gpc_y | OTGPPConsent_after_gpc_y | third_party_count | third_party_urls | unique_ad_networks | num_unique_ad_networks -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- https://apple.com (1) | 18 | added | apple.com | 1 | null | null | null | null | null | null | null | null | {"firstParty":{},"thirdParty":{}} | null | null | null | null | null | null | None | None | neither | null | None | 42 | null | null | null | null | 0 | {} | [] | 0 https://sprint.com (2) | 84 | not added | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | null | None | None | null | WebDriverError: Reached Error Page, singleTimeoutError | None | 152 | null | null | null | null | null | nan | nan | null I also counted these two types of null values/error sites in my percentage calculation accordingly and hence excluded them from the figures. Next: I'll be working on updating the code for the Crawl_Data_Over_Time (though this might take more time as I'm still working on fully understanding this Colab file).

SebastianZimmeck commented 3 days ago

As we discussed today, at this point this issue is purely one for the data processing after the crawl. @franciscawijaya mentioned that the Crawl_Data_Over_Time still needs to be updated. Both @franciscawijaya and @natelevinson10 will work on adapting this and the other scripts as necessary.

privacy-tech-lab / gpc-web-crawler

Deal with null entries in post-crawl data analysis #121