scylladb / gemini

Test data integrity by comparing against an Oracle running in parallel
Apache License 2.0
27 stars 16 forks source link

Log details required for debugging failed tests #409

Closed ptrsmrn closed 6 months ago

ptrsmrn commented 6 months ago

Given: https://github.com/scylladb/scylla-enterprise/issues/3642#issuecomment-1849642826:

tool is not trying to explain what happened, or what sequence of operations led to bad result

https://github.com/scylladb/scylla-enterprise/issues/3642#issuecomment-1838697317:

We don't know anything about the statements that gemini executed — we only know the result.

https://github.com/scylladb/scylla-enterprise/issues/3642#issuecomment-1838820807:

I don't think I will be able to figure anything out without reproducing this issue with some actual logs.

can we improve logging to the point that the cited issue and alike issues are (easily) debuggable? That would include:

  1. logging executed statements,
  2. ability to re-execute these statements using the gemini framework.

cc @mykaul @nuivall

bhalevy commented 6 months ago

For the purpose of reproducing issues, it would also be good to be able to replay the queries from the log. For that I believe that if the log would be of CQL commands (USING TIMESTAMP), we could just use cqlsh to replay them.

dkropachev commented 6 months ago

@ptrsmrn , @bhalevy, I have different perspective on the issue, let's discuss it:

  1. Issue is 100% valid and I have been thinking on how to solve it for quite long time.
  2. I don't think logging sql statements is not best solution, since it is going to have tons of data, most of which is not usefull Say it runs modest 500 rq/s, it is going to generate 1800000 records over an hour, smallest sct test is 4h test, so you gonna get 7200000, which is in the very very best case translates to 196 mb of data, which is hard to store, retrieve and analyze.

Instead of logging (or in parallel to it) I would suggest the following:

  1. Every query is remembered by gemini (memory is not an option, we can go with sqlite), in such manner that it can be found by pk/ck
  2. Test-wide queries, like schema changes, are stored separately
  3. When error happens gemini pulls all the queries that where executed for given pk/ck with test-wide queries and logs/prints them

Reproducibility problem should be addressed by setting seeds, unfortunately it does not work properly, https://github.com/scylladb/gemini/issues/370, but once issue is fixed you should be able to reproduce exact CQL flow by just setting --schema-seed and --seed

bhalevy commented 6 months ago

@ptrsmrn , @bhalevy, I have different perspective on the issue, let's discuss it:

  1. Issue is 100% valid and I have been thinking on how to solve it for quite long time.
  2. I don't think logging sql statements is not best solution, since it is going to have tons of data, most of which is not usefull Say it runs modest 500 rq/s, it is going to generate 1800000 records over an hour, smallest sct test is 4h test, so you gonna get 7200000, which is in the very very best case translates to 196 mb of data, which is hard to store, retrieve and analyze.

196 MB are not hard to store.

As for analyzing the log, leave it to the engineer who's working on the issue. One can use other tools like grep or write scripts to automate parts of the analysis. But we need the raw data to start with. Hiding stuff based on what we know today doesn't help since you will need that data for analyzing future problems that are unknown yet, so I would suggest to "expect the unexpected".

Instead of logging (or in parallel to it) I would suggest the following:

  1. Every query is remembered by gemini (memory is not an option, we can go with sqlite), in such manner that it can be found by pk/ck
  2. Test-wide queries, like schema changes, are stored separately
  3. When error happens gemini pulls all the queries that where executed for given pk/ck with test-wide queries and logs/prints them

Reproducibility problem should be addressed by setting seeds, unfortunately it does not work properly, #370, but once issue is fixed you should be able to reproduce exact CQL flow by just setting --schema-seed and --seed

ptrsmrn commented 6 months ago

@dkropachev Saving logs only for failed tests makes sense, given their size. How gemini framework stores them internally, a file or sqlite, is not a concern to devs as long as devs don't need to query sqlite, so gemini logging them onto the screen/file or anything alike, as you suggested, is IMO good - essentially I'd just make these extra logs easily available for devs, so we don't hear further complaints ;) A note regarding the file size - best to not produce too big files that are hard to open or are bloated with "spam", because it'll make debugging process harder, but that's just a wishful thinking, as I understand each test produces tons of queries.