superfly / litefs

FUSE-based file system for replicating SQLite databases across a cluster of machines
Apache License 2.0
4.09k stars 98 forks source link

Replication Not Working #424

Open scefali opened 10 months ago

scefali commented 10 months ago

I have an app deployed on 2 machines on Fly.io. Replication between the nodes isn't working properly. When SSH in to each one and run SQL queries, I get different results. Here's my Dockerfile.

scefali commented 10 months ago

@kentcdodds This is forked version of EpicStack, have you ever seen anything like this before?

kentcdodds commented 10 months ago

I've not seen this before. Very odd. I would suggest scaling down to 1, then cloning the machine again.

scefali commented 10 months ago

@kentcdodds good idea, I removed the machine and re-added it and it looks ok now. Hopefully this split brain issue doesn't come back!

kentcdodds commented 10 months ago

🤞

scefali commented 10 months ago

@kentcdodds and it came back after a few deployments :(

benbjohnson commented 10 months ago

@scefali Can you provide some logs for the two instances when they start up? Are they showing errors connecting to one another?

scefali commented 10 months ago

@benbjohnson Here you go:

2024-01-15T21:57:14.036 app[56833295f19048] sjc [info] 2024/01/15 21:57:14 listening on [fdaa:0:5d71:a7b:24b:b2ec:af:2]:22 (DNS: [fdaa::3]:53)

2024-01-15T21:57:14.059 app[56833295f19048] sjc [info] config file read from /etc/litefs.yml

2024-01-15T21:57:14.059 app[56833295f19048] sjc [info] LiteFS v0.5.10, commit=1dc7fcfa7360313fe573c2ff7ad01ef58c05c146

2024-01-15T21:57:14.059 app[56833295f19048] sjc [info] level=INFO msg="host environment detected" type=fly.io

2024-01-15T21:57:14.059 app[56833295f19048] sjc [info] level=INFO msg="no backup client configured, skipping"

2024-01-15T21:57:14.060 app[56833295f19048] sjc [info] level=INFO msg="Using Consul to determine primary"

2024-01-15T21:57:14.283 app[56833295f19048] sjc [info] level=INFO msg="initializing consul: key=litefs/contribution-analyzer url=https://:b6de754c-5495-c209-472f-34643e3e19a9@consul-iad.fly-shared.net/contribution-analyzer-pxovqy2evgy1j2k6/ hostname=56833295f19048 advertise-url=http://56833295f19048.vm.contribution-analyzer.internal:20202"

2024-01-15T21:57:14.284 app[56833295f19048] sjc [info] level=INFO msg="using existing cluster id: \"LFSC48A9E18A2737D2E0\""

2024-01-15T21:57:14.287 app[56833295f19048] sjc [info] level=INFO msg="LiteFS mounted to: /litefs/data"

2024-01-15T21:57:14.287 app[56833295f19048] sjc [info] level=INFO msg="http server listening on: http://localhost:20202"

2024-01-15T21:57:14.287 app[56833295f19048] sjc [info] level=INFO msg="waiting to connect to cluster"

2024-01-15T21:57:14.580 app[56833295f19048] sjc [info] level=INFO msg="D3D9F5BC92A0DC5D: existing primary found (5683d266b62008), connecting as replica to \"http://5683d266b62008.vm.contribution-analyzer.internal:20202\""

2024-01-15T21:57:14.586 app[56833295f19048] sjc [info] level=INFO msg="connected to cluster, ready"

2024-01-15T21:57:14.586 app[56833295f19048] sjc [info] level=INFO msg="node is a candidate, automatically promoting to primary"

2024-01-15T21:57:14.585 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: stream connected ([fdaa:0:5d71:a7b:24b:b2ec:af:2]:55468)"

2024-01-15T21:57:14.667 app[56833295f19048] sjc [info] level=INFO msg="D3D9F5BC92A0DC5D: disconnected from primary, retrying"

2024-01-15T21:57:14.665 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: exiting primary, preserving lease for handoff"

2024-01-15T21:57:14.665 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: stream disconnected ([fdaa:0:5d71:a7b:24b:b2ec:af:2]:55468)"

2024-01-15T21:57:14.741 app[56833295f19048] sjc [info] level=INFO msg="D3D9F5BC92A0DC5D: acquiring existing lease from handoff"

2024-01-15T21:57:14.814 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: existing primary found (5683d266b62008), connecting as replica to \"http://5683d266b62008.vm.contribution-analyzer.internal:20202\""

2024-01-15T21:57:14.815 app[5683d266b62008] sjc [info] level=INFO msg="http: POST /stream: error: cannot connect to self"

2024-01-15T21:57:14.815 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: disconnected from primary with error, retrying: connect to primary: invalid response: code=400 ('http://5683d266b62008.vm.contribution-analyzer.internal:20202')"

2024-01-15T21:57:14.892 app[56833295f19048] sjc [info] level=INFO msg="D3D9F5BC92A0DC5D: primary lease acquired, advertising as http://56833295f19048.vm.contribution-analyzer.internal:20202"

2024-01-15T21:57:14.967 app[56833295f19048] sjc [info] level=INFO msg="proxy server listening on: http://localhost:8080"

2024-01-15T21:57:14.967 app[56833295f19048] sjc [info] level=INFO msg="executing command: node [./other/setup-swap.js]"

2024-01-15T21:57:15.159 app[56833295f19048] sjc [info] setting up swapfile...

2024-01-15T21:57:15.192 app[56833295f19048] sjc [info] swapfile setup complete

2024-01-15T21:57:15.197 app[56833295f19048] sjc [info] level=INFO msg="executing command: npx [prisma migrate deploy]"

2024-01-15T21:57:15.974 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: existing primary found (56833295f19048), connecting as replica to \"http://56833295f19048.vm.contribution-analyzer.internal:20202\""

2024-01-15T21:57:15.976 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: disconnected from primary with error, retrying: connect to primary: Post \"http://56833295f19048.vm.contribution-analyzer.internal:20202/stream\": read tcp [fdaa:0:5d71:a7b:138:6263:1e5f:2]:53966->[fdaa:0:5d71:a7b:24b:b2ec:af:2]:20202: read: connection reset by peer ('http://56833295f19048.vm.contribution-analyzer.internal:20202')"

2024-01-15T21:57:16.307 app[56833295f19048] sjc [info] Prisma schema loaded from prisma/schema.prisma

2024-01-15T21:57:16.315 app[56833295f19048] sjc [info] Datasource "db": SQLite database "data.db" at "file:./data.db?connection_limit=1"

2024-01-15T21:57:16.347 app[56833295f19048] sjc [info] SQLite database data.db created at file:./data.db?connection_limit=1

2024-01-15T21:57:16.356 app[56833295f19048] sjc [info] 1 migration found in prisma/migrations

2024-01-15T21:57:16.362 app[56833295f19048] sjc [info] Applying migration `20240106184210_init`

2024-01-15T21:57:16.369 app[56833295f19048] sjc [info] The following migration have been applied:

2024-01-15T21:57:16.369 app[56833295f19048] sjc [info] migrations/

2024-01-15T21:57:16.369 app[56833295f19048] sjc [info] └─ 20240106184210_init/

2024-01-15T21:57:16.369 app[56833295f19048] sjc [info] └─ migration.sql

2024-01-15T21:57:16.369 app[56833295f19048] sjc [info] All migrations have been successfully applied.

2024-01-15T21:57:16.424 app[56833295f19048] sjc [info] level=INFO msg="starting background subprocess: npm [start]"

2024-01-15T21:57:16.424 app[56833295f19048] sjc [info] waiting for signal or subprocess to exit

2024-01-15T21:57:16.749 app[56833295f19048] sjc [info] > start

2024-01-15T21:57:16.749 app[56833295f19048] sjc [info] > cross-env NODE_ENV=production node .

2024-01-15T21:57:17.126 app[5683d266b62008] sjc [info] level=INFO msg="6EC604529F5D5E05: existing primary found (56833295f19048), connecting as replica to \"http://56833295f19048.vm.contribution-analyzer.internal:20202\""

2024-01-15T21:57:17.130 app[56833295f19048] sjc [info] level=INFO msg="D3D9F5BC92A0DC5D: stream connected ([fdaa:0:5d71:a7b:138:6263:1e5f:2]:46138)"

2024-01-15T21:57:18.305 app[56833295f19048] sjc [info] 🚀 We have liftoff!

2024-01-15T21:57:18.305 app[56833295f19048] sjc [info] Local: http://localhost:8081

2024-01-15T21:57:18.305 app[56833295f19048] sjc [info] On Your Network: http://172.19.7.106:8081

2024-01-15T21:57:18.305 app[56833295f19048] sjc [info] Press Ctrl+C to stop

2024-01-15T21:57:25.500 runner[5683d266b62008] sjc [info] Pulling container image registry.fly.io/contribution-analyzer:deployment-01HM7HYG149DE9CXN71V5EKH3X
scefali commented 10 months ago

@benbjohnson Any ideas?