Closed tatsuya6502 closed 1 year ago
Details of the issue: https://github.com/MaterializeInc/database-issues/issues/5879
It seems easy to reproduce if we can run an integration end-to-end test called "Long Zippy w/ user tables".
This happened twice in a row in long Zippy w/ user tables test: https://buildkite.com/materialize/release-qualification/builds/197#018883bf-5086-4a43-a8b4-ede1ef192de5
- id: zippy-user-tables-large
label: "Long Zippy w/ user tables"
timeout_in_minutes: 2880
agents:
queue: linux-x86_64-large
artifact_paths: junit_*.xml
plugins:
- ./ci/plugins/mzcompose:
composition: zippy
args: [--scenario=UserTablesLarge, --actions=200000]
I ran a short version of "Zippy w/ user tables test" on Linux x86_64. I used the same scenario, but smaller number of actions 2,000 instead of 200,000.
I could not reproduce the issue. I will try a longer run.
Environment:
Terminal Log:
$ git clone git@github.com:MaterializeInc/materialize.git
$ cd materialize
$ git branch moka-segfault 07386e427f
$ git checkout moka-segfault
$ rg moka
Cargo.lock
2903: "moka",
3241:name = "moka"
4380: "moka",
8629: "moka",
src/persist-client/Cargo.toml
41:moka = { version = "0.9.6", default-features = false, features = ["sync"] }
src/persist-client/src/internal/cache.rs
16:use moka::notification::RemovalCause;
17:use moka::sync::Cache;
100: // This could maybe use moka's async cache to unify any concurrent
src/workspace-hack/Cargo.toml
63:moka = { version = "0.9.6", default-features = false, features = ["atomic64", "sync"] }
164:moka = { version = "0.9.6", default-features = false, features = ["atomic64", "sync"] }
$ ./bin/mzcompose --find zippy run default --scenario UserTablesLarge --actions 2000
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-WBP7JFLWFQYUIHLXR5JSOAJ74VVKI6O7
materialize/clusterd:mzbuild-Q77JA6BIOATFUZAWANSIJFOCUXV3GOGX
materialize/materialized:mzbuild-IIUOBQEI7JVK3YSGYYV6FP3EP3JGFF5I
materialize/test-certs:mzbuild-6JWZ2MGLSYOCL3MLHIEIJMMLQLEOLCNY
materialize/postgres:mzbuild-ENTVWBFMO762DEEKAMURDMVMQ52GIO63
materialize/testdrive:mzbuild-ANAR5HEUWWDMTQUJ3ZA7XJWEY64H2ZTB
...
==> Running test case workflow-default
==> Running workflow default
...
[+] Running 3/3
✔ Container zippy-cockroach-1 Healthy
✔ Container zippy-minio-1 Healthy
✔ Container zippy-materialized-1 Healthy
> ALTER SYSTEM SET enable_unmanaged_cluster_replicas = true;
> CREATE CLUSTER storaged REPLICAS (r2 (
STORAGECTL ADDRESSES ['storaged:2100'],
STORAGE ADDRESSES ['storaged:2103'],
COMPUTECTL ADDRESSES ['storaged:2101'],
COMPUTE ADDRESSES ['storaged:2102'],
WORKERS 4
))
[+] Stopping 1/1
✔ Container zippy-materialized-1 Stopped
Going to remove zippy-materialized-1
[+] Removing 1/0
✔ Container zippy-materialized-1 Removed
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-testdrive-1 Healthy
Generating test...
Running test...
--- MaterializeInc/database-issues#1: KafkaStart
[+] Building 0.0s (0/0)
[+] Running 2/2
✔ Container zippy-zookeeper-1 Healthy
✔ Container zippy-kafka-1 Healthy
--- MaterializeInc/database-issues#2: CockroachStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-cockroach-1 Healthy
--- MaterializeInc/database-issues#3: MinioStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-minio-1 Healthy
--- MaterializeInc/database-issues#4: MzStart
[+] Building 0.0s (0/0)
[+] Running 3/3
✔ Container zippy-cockroach-1 Healthy
✔ Container zippy-minio-1 Healthy
✔ Container zippy-materialized-1 Healthy
--- MaterializeInc/materialize#5: StoragedStart
[+] Building 0.0s (0/0)
[+] Running 1/1
✔ Container zippy-storaged-1 Healthy
--- MaterializeInc/database-issues#5: CreateTable
> CREATE TABLE table_0 (f1 INTEGER);
rows match; continuing at ts 1687305573.2167296
> INSERT INTO table_0 VALUES (0);
rows match; continuing at ts 1687305573.2269592
...
> INSERT INTO table_0 SELECT * FROM generate_series(568924, 572569);
rows match; continuing at ts 1687306589.1175394
==> mzcompose: test case workflow-default succeeded
There are almost 300 commits between when moka was added and when removed: https://github.com/MaterializeInc/materialize/compare/1bcd1c4b89...07386e427f
I need to know when (which commit) last failed.
There are almost 300 commits between when moka was added and when removed: https://github.com/MaterializeInc/materialize/compare/1bcd1c4b89...07386e427f
I need to know when (which commit) last failed.
The following commits were mentioned in https://github.com/MaterializeInc/database-issues/issues/5879.
- https://github.com/MaterializeInc/materialize/commit/9ab5f833b919f8a5200b3918eca0847f40cc3337 (Jun 4, 2023)
- "Long Zippy w/ user tables" failed.
Ran the test (with 5,000 actions) several times on my home PC, but I have not reproduced the issue.
Environment:
- Docker (x86_64)
- Intel Core i7-12700F
- 8 × performance cores × 2 hyper threads
- 4 × efficient cores
My PC (Windows 11 + WSL2) does not have enough RAM to run longer Zippy test. It has only 16GB of RAM and Docker containers like Kafka sometimes get killed while running test. I will expand the RAM to 32GB.
I hope running longer test would help to reproduce the issue. But another aspect could be the number of available cores/hyper threads. My PC has 16 cores and 20 threads, but their EC2 instance only has 8 vCPUs:
I will probably adjust the number of hyper threads available for Docker:
In WSL 2 mode, you can configure limits on the memory, CPU, and swap size allocated to the WSL 2 utility VM.
Also, I will add some tracing logs to both Moka and Materialize to get better understandings on the access pattern.
@tatsuya6502 I have 32 threads / 128 ram and can leave test running for night if this can help
@0xdeafbeef Thanks for the offer. Wow, you have a great machine!
I think I am going to be okay. I already ordered the RAM modules and am waiting for them to arrive. I think 32GB should be enough for now.
FYI, I got the RAM modules for my PC, and confirmed it can run the test with at least 50,000 actions. I allocated 24GiB of memory and only 8 logical CPU cores to WSL2.
The reason I did not try 200,000 actions is that it will take long time (Estimated ~30 hours). 50,000 actions took 7 hours and 14 minutes. The memory utilization was fine so I had feeling that it could run 200,000 without issue.
Also, before I got the RAM module, I was also able to run 80,000 actions. I found that reducing the number of logical CPU cores for WSL2/Docker will reduce the memory utilization of the test. So I allocated only 8 cores (instead of 20) and 10GiB of memory (instead of 8GiB) to WSL2 and it was able to run 80,000 actions.
I talked to a Materialize engineer. They will try to reproduce the issue using an EC2 instance and give me the full set of instructions.
https://github.com/MaterializeInc/materialize/discussions/20136#discussioncomment-6274272
Hopefully someone from the QA team has time to attempt the repro early next week and give you the full set of instructions.
With some helps from Materialize engineers, I made some progress. I reproduced the segmentation fault in Moka v0.9.6 in Materialize, and got a core dump to analyze.
NOTES: I had to use an Amazon EC2 c6a.12xlarge
instance, which has 48 × vCPUs. This is because using my home PC (20 × HyperThreads) can take few days to reproduce with the current test (details). Also it was a bit tricky to get core dump in a Docker container, so it took some time to me to figure out. (An modified Materialize test harness for getting core dumps)
Here is the back trace of the crashed thread:
It crashed in the Deque::move_to_back
method (src/common/deque.rs#L202), which was called by the base_cache::Inner::handle_upsert
method (src/sync_base/base_cache.rs#L1414).
Materialize uses Moka v0.9.6, and this problem may have been fixed by #259 in v0.11.0.
I will back-port the fix to v0.9.x, and run the same test to see if the problem goes away.
If this is related to #259, reproducing steps could be the followings. However the end result depends largely on how the compiler optimizes reading an immutable reference &
to a LRU double-ended queue (deque).
Create a cache with weighted capacity 6.
Stop the periodic sync
.
Get not-existing key X
once. (popularity = 1)
Get not-existing key Y
once. (popularity = 1)
Insert key X
with weight 2.
Insert key Y
with weight 2.
Call sync
:
X
and Y
are admitted.X
→ Y
(back)Get not-existing key V
three times. (popularity = 3)
Insert V
with weight 3.
Invalidate Y
.
Y
from the concurrent hash table (cht), but removing it’s node from the deque will be delayed to a future sync
.Call sync
. The followings will happen and the last one can cause a segmentation fault:
X
and Y
are selected as the victims (entries to evict), and V
is marked as admitted.
X
+ Y
= 4, V
= 3X
+ Y
= 2, V
= 3X
is removed from the cht, and its deque node is unlinked and dropped.
X
node changes Y
node's prev
pointer from Some(NonNull(X))
to None
.Y
is no longer in the cht, so Y
's deque node is added to the skipped_nodes
vector.V
's deque node is created and pushed to the back of the deque.skipped_nodes
are processed. (There is only Y
node): Y
node will be moved to the back of the deque.
move_to_back
method checks Y
's prev
pointer, which is None
now.prev
pointer in a CPU register (pointing to X
), instead of reloading the actual value None
from the memory.X
was already dropped, so trying to access X
(in the following code: *prev.as_ptr()
) will cause the segmentation fault.188: pub(crate) unsafe fn move_to_back(&mut self, mut node: NonNull<DeqNode<T>>) {
...
201: match node.prev {
202: Some(prev) if node.next.is_some() => (*prev.as_ptr()).next = node.next,
- Invalidate
Y
.
- This removes
Y
from the concurrent hash table (cht), but removing it’s node from the deque will be delayed to a futuresync
.- Call
sync
. The followings will happen and the last one can cause a segmentation fault:
X
andY
are selected as the victims (entries to evict), andV
is marked as admitted.
- weights:
X
+Y
= 4,V
= 3- popularity:
X
+Y
= 2,V
= 3X
is removed from the cht, and its deque node is unlinked and dropped.
- Unlinking
X
node changesY
node'sprev
pointer fromSome(NonNull(X))
toNone
.- Because of step 10,
Y
is no longer in the cht, soY
's deque node is added to theskipped_nodes
vector.
I reviewed code and realized that above will not reproduce the problem.
Y
) needs to be done in a smaller time window.X
and Y
are selected as ...)Y
is no longer in the cht, ...).Materialize uses Moka v0.9.6, and this problem may have been fixed by #259 in v0.11.0.
- MaterializeInc/materialize#259
I will back-port the fix to v0.9.x, and run the same test to see if the problem goes away.
I backported the fix to v0.9.x:
I am running "Long Zippy w/ user tables" on an EC2 c6a.12xlarge
instance (48 vCPUs). It has been running for 13 hours 14 hours without reproducing the problem. This looks promising.
I will release Moka v0.9.8 with MaterializeInc/materialize#283 after I finish running pre-release tests. And then I will let the Materialize engineers know that the problem may have been fixed.
I will backport MaterializeInc/materialize#259 to Moka v0.10.x as well. I will keep this issue open until I am done with it.
Closing as completed.
Moka v0.9.x
Moka v0.10.x
Split from MaterializeInc/database-issues#93.
are yet unknown.=> Linux x86_64moka::sync::Cache
are used in async contexts.Cache
instances. (An instance per blob URL. Do not know how many)I am trying to reproduce the issue using mokabench but no luck: https://github.com/moka-rs/moka/issues/279#issuecomment-1596820321
CC: @ParkMyCar