samvera / questioning_authority

Question your authorities
Other
54 stars 30 forks source link

Periodically Failing Tests resulting in failing master build #260

Closed tpendragon closed 4 years ago

tpendragon commented 5 years ago

https://circleci.com/gh/samvera/questioning_authority/766

botimer commented 5 years ago

I've done a pretty thorough investigation today. The webmock response takes effect at the Net::HTTP level by way of the RDF gem's graph loading. The Faraday stuff for FAST is not implicated here. The failures only happened on one occasion that I see, and they all appear to be from mismatched expectations vs. mocked responses, not real HTTP requests leaking out.

Specifically, multiple were from getting a Qa::ServiceError (500) instead of Qa::ServiceUnavailable (503), but other mocked responses hit other examples, too.

My only suspicion at this point is that there was an unusual timing/parallel issue. I've run the suite with the random seed of the failed run (44328), and it passes. There is a bit of code in the RDF gem that uses Net::HTTP::start in an until loop, and then there is some class-level interception in webmock. This is likely a narrow isolation issue well upstream.

Since this has only failed once and we have confirmed that the mocks are at least being applied consistently, I'll close this for now, pending recurrent problems.

botimer commented 5 years ago

This is failing more often now. There must be some sticky/leaking state. Looking into it...

jrgriffiniii commented 4 years ago

Exploring these failures on CircleCI, I can replicate these over SSH and use that environment for debugging. From what I’m seeing on CI, https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L93 parses the error and looks for the HTTP status code when RDF::Graph.load(url) fails to load from a URL in https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L11 https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L104 is what generates the Internal Server Error - Fetch term... which fail the expectations

Further, https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L199 returns nil, and the failures seem to start on https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L209. I was able to trigger https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L234 to be raised.

jrgriffiniii commented 4 years ago

elrayle further diagnosed this issue on CircleCI, finding that there was a disparity in error messages which are parsed for the HTTP status code:

• local msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: (404) • CCI msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: 404

Please note the missing parenthesis in the error message raised on CircleCI.

botimer commented 4 years ago

This was caused by an older version of the Coveralls gem being included in the bundle on CircleCI. How?

  1. QA depends on the coveralls gem without a version specified
  2. The cached bundle for CI builds used version 0.7.2, which depends on rest-client
  3. When rest-client is loaded, RDF::Util::File uses it, rather than Net::HTTP
  4. The IOError raised only includes a formatted message, which is parsed to re-specify the error for downstream use within QA
  5. The adapters format their error messages differently (without and with parentheses, respectively), so the environmental difference between an old bundle and a new one (coveralls 0.8.23) results in different QA errors being raised by way of a missed case condition

As far as the underlying issues:

  1. The rdf gem should return structured info it already has (subclass IOError to include url and status code attributes)
  2. The rdf gem uses implicit config that it does not log or reveal in a meaningful way (i.e., RDF::Graph.load relies on a class variable of RDF::Util::File that is lazily set based on which gems are loaded, so even the existence of the adapter, let alone its identity, is not obvious)
  3. Implicit dependency versions combined with caching conspire to create a CI environment that is sensitive to when it was initialized, in a way that is challenging to detect or reproduce locally (even using circleci local execute on a workstation)
botimer commented 4 years ago

PR #264 is merged and works around the issue. We will follow up with ruby-rdf upstream and, pending changes, move to a more robust solution (e.g., explicit status codes rather than parsing).

elrayle commented 4 years ago

The only additional info I would add to this is that the occurrence of error message with parentheses and without was random causing random failures of the tests. The patch in PR #264 uses regex to extract the error code and allows for both formats of the error message. This is only slightly less fragile than before. The recommendation to move to a structured error message for RDF::Graph.load would allow the error code processing to be more robust.