Investigate high crash rate of the WebExtensions crawls

englehardt commented 5 years ago

In a 1 million site crawl with our most recent Firefox 60 support has a relatively high command crash rate (13% of get commands). Compare this to 3.5% for the current Firefox 52-based master branch.

A (partial) log file from a recent 1M site crawl is available here: https://www.dropbox.com/s/320d4my3b2prnc8/2019_01_29_alexa1m_logs.bz2?dl=0

motin commented 5 years ago

13% of get commands

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

It seems like the right thing to do here is to set up Sentry or similar solution so that patterns in these errors can be properly detected and troubleshot.

englehardt commented 5 years ago

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

Yes, see https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/71922/command/71923

englehardt commented 5 years ago

/cc @gunesacar, who also has some data on this comparing the 1M site crawls we used to run on FF52 and a crawl with the new webextensions instrumentation.

gunesacar commented 5 years ago

/cc @gunesacar, who also has some data on this comparing the 1M site crawls we used to run on FF52 and a crawl with the new webextensions instrumentation.

Yes, we got 4x more GET command failures with the webextensions instrumentation compared to the previous (November 2018) crawl. See the 7th figure from the top on this notebook: https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb

Happy to share the crawl logs and the database. You can also download them as part of the Princeton Web Census Dataset).

motin commented 5 years ago

Yes, we got 4x more GET command failures with the webextensions instrumentation compared to the previous (November 2018) crawl. See the 7th figure from the top on this notebook: https://github.com/citp/openwpm-data-release/blob/master/Crawl-Data-Metrics.ipynb

This is such a great notebook. Thanks for sharing this! I am looking forward to getting Sentry setup properly (#406) so that we systematically can address this.

motin commented 5 years ago

How is this number calculated? Do we store information about crashes in the resulting crawl datasets?

Yes, see https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/71922/command/71923

The notebook summarizes the following failure rates, with 100% being the 893453 crawl_history records:

Percentage of command failures 12.26%
Percentage of command timeouts 5.61%

An expansion of this notebook (here), states:

.:: In relation to the total number of commands submitted (n = 893453)
Percentage of command failures 12.26%
Percentage of command timeouts 5.61%
Percentage of successful commands 82.13%
Percentage of non-successful commands 17.87%

Corresponding data for a recent pair of crawls that uses the current master branch: crawl-a:

.:: In relation to the total number of commands submitted (n = 107067)
Percentage of command failures 5.96%
Percentage of command timeouts 7.62%
Percentage of successful commands 86.42%
Percentage of non-successful commands 13.58%

crawl-b:

.:: In relation to the total number of commands submitted (n = 107125)
Percentage of command failures 6.13%
Percentage of command timeouts 7.53%
Percentage of successful commands 86.34%
Percentage of non-successful commands 13.66%

In summary, this issue is about investigating "Percentage of command failures", and it seems that it has dropped from 12.3% to about 6% since January. That is good news!

motin commented 5 years ago

The fact that out of the 1M crawl list only 89% ended up with a crawl_history record may be related to data loss in the S3 Aggregator, which is filed as a separate issue here: https://github.com/mozilla/OpenWPM/issues/450

motin commented 5 years ago

Recently ran a Tranco top 500 crawl, resulting in:

.:: In relation to the total number of commands submitted (n = 500)
Percentage of command failures 14.40%
Percentage of command timeouts 4.20%
Percentage of successful commands 81.40%
Percentage of non-successful commands 18.60%

.:: In relation to the original seed list (n = 500)
Percentage that did not result in a crawl_history record: 0.00%
Percentage that failed to result in a successful command: 18.60%

Source: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/156111

The 18.6% corresponds to 93 sites that was not crawled.

I found 82 errors in Sentry:

34 errors - InvalidSessionIdException (reported as https://github.com/mozilla/OpenWPM/issues/463)
33 errors - about:neterror - dnsNotFound
10 errors - about:neterror - netTimeout
2 errors - Failed to decode response from marionette
1 error - about:neterror - connectionFailure
1 error - about:neterror - nssFailure
1 error - NoSuchWindowException

@englehardt, recent improvements in error reporting and logging has made this kind of analysis possible (awesome!), but what fate did the 11 sites or so (2.2%) meet that did not get a successful command in crawl_history and no error reported in Sentry?

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

For some manual triage:

Bool_success -1 urls:

Bool_success 0 urls:

englehardt commented 5 years ago

I found 82 errors in Sentry:

* 34 errors - InvalidSessionIdException (reported as #463)

* 33 errors - about:neterror - dnsNotFound

* 10 errors - about:neterror - netTimeout

* 2 errors - Failed to decode response from marionette

* 1 error - about:neterror - connectionFailure

* 1 error - about:neterror - nssFailure

* 1 error - NoSuchWindowException

These InvalidSessionIdExceptions have the same cause as the Failed to decode response from marionette we were getting in earlier crawls. The reason we're now getting these is because I started ignoring webdriver exceptions for the alert handling code to see if the alert handling was the cause or if it was a general issue -- the existence of these errors show it's a general webdriver issue. The python side of webdriver is somehow losing connection to the browser after the get() command, leading the next command to fail.

These errors have been elusive. If you know sites that reliably cause these errors please do share! I've sampled sites from the Sentry logs and have had trouble reproducing them locally, even when I purposefully run crawls in a memory or cpu constrained VM. I have a WIP PR that makes log levels configurable and bumps up console output logs to DEBUG by default. This will allow us to see the FirefoxExtension logs in GCP (which contain native logs from webdriver). Maybe we'll be able to trace down the cause through that.

@englehardt, recent improvements in error reporting and logging has made this kind of analysis possible (awesome!), but what fate did the 11 sites or so (2.2%) meet that did not get a successful command in crawl_history and no error reported in Sentry?

This is concerning. I see two possibilities:

All of the sites records are dropped by SocketInterface.py, perhaps due to a serialization problem for something that's present it all records (e.g., the top_level_url?). This is something I'm trying to fix in #441, but progress has been slow because the cross-process error logging uses SocketInterface.py and thus I think we'll falling into an infinite loop of decoding errors. Would you mind to let me know the URLs of the 11 missing sites?
There's a bug in the S3Aggregator logic that allows sites to be dropped.

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

Agreed. I'm planning to save the serialized exceptions of command failures in a new column in crawl_history. Will that work for you? It won't contain all of the errors that lead to a crash, but will contain all of the browser errors that are currently handled by the catch-all try-except in the browser manager process.

motin commented 5 years ago

These InvalidSessionIdExceptions have the same cause as the Failed to decode response from marionette we were getting in earlier crawls.

Ah, yes, thanks now I remember that you explained this in a recent PR.

These errors have been elusive. If you know sites that reliably cause these errors please do share!

The sites are listed above - about a third ought to reproduce the error :)

Would you mind to let me know the URLs of the 11 missing sites?

They are listed above. I realize my previous statement was confusing. All sites have records in crawl_history so no data loss was encountered, but since there were 93 sites with bool_success != 1 but only 82 errors in sentry, there ought to be 11 sites with bool_success != 1 that did not get any error reported in Sentry - I just don't know which particular sites/URLs, and it's weird that they got a bool_success != 1 despite not reporting an error.

I'm planning to save the serialized exceptions of command failures in a new column in crawl_history. Will that work for you?

Sounds like a good start! Anything that helps track down the root case of bool_success != 1 is good :)

englehardt commented 5 years ago

They are listed above. I realize my previous statement was confusing. All sites have records in crawl_history so no data loss was encountered, but since there were 93 sites with bool_success != 1 but only 82 errors in sentry, there ought to be 11 sites with bool_success != 1 that did not get any error reported in Sentry - I just don't know which particular sites/URLs, and it's weird that they got a bool_success != 1 despite not reporting an error.

Got it. That's less concerning :) I'll add in the saving of command errors and we can investigate from there.

englehardt commented 5 years ago

It would be great if crawl_history could contain information about reported Sentry errors so that this can be easily answered. It would also help with filtering out about:net errors (9%) which are probably not related to bugs in the platform.

This is done in #473.

englehardt commented 5 years ago

I think I found the cause of the high rate of WebDriverExceptions, which manifests as errors #404 and #463. I believe my intuition in https://github.com/mozilla/OpenWPM/issues/255#issuecomment-522088140 was correct. The errors are caused by Firefox crashing during the get() command. That explains why we don't see any errors until marionette attempts to issue a new command to Firefox, only to find that the connection is broken in some way.

I only observe the high crash rate when running in a Docker container, which is why all of my earlier investigations didn't turn up anything useful (I was just inspecting a few sites locally). I believe it's caused by a very low /dev/shm (shared memory) size default in Docker. See https://github.com/elgalu/docker-selenium/issues/20. This comment lists a helpful collection of issues as well (https://github.com/karma-runner/karma-firefox-launcher/issues/104#issuecomment-522371253). The fix should be easy: just bump the shared memory size of the container to 2G. That's easy to do when running Docker locally, but we may need to add it at container build time or in the kubernetes config. See also: https://stackoverflow.com/a/46434614.

englehardt commented 5 years ago

This should be fixed in https://github.com/mozilla/openwpm-crawler/pull/28

englehardt commented 5 years ago

I also created https://github.com/mozilla/OpenWPM/pull/475 to throw better error messages in the event of a Firefox process crash.

englehardt commented 5 years ago

This was fixed in https://github.com/mozilla/openwpm-crawler/pull/28, https://github.com/mozilla/openwpm-crawler/pull/30, and #477. In particular, https://github.com/mozilla/openwpm-crawler/pull/30#issue-311206732 gives a detailed description of why the new config parameters improve the stability of the crawler. In a recent test crawl (https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/166927/command/167004) we saw just 5 errors of this type across 100k sites.

openwpm / OpenWPM

Investigate high crash rate of the WebExtensions crawls #255