typesense / typesense

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
https://typesense.org
GNU General Public License v3.0
20.98k stars 649 forks source link

On daily bulk import queued writes halt writing for an hour or more #1992

Closed kdevan closed 3 weeks ago

kdevan commented 3 weeks ago

Description

2024-10-06T11:33:07.625 app[90801e41c6dd28] sea [info] E20241006 11:33:07.625293 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:33:16.627 app[90801e41c6dd28] sea [info] E20241006 11:33:16.626950 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:33:25.628 app[90801e41c6dd28] sea [info] E20241006 11:33:25.628088 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:33:34.630 app[90801e41c6dd28] sea [info] E20241006 11:33:34.629874 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:33:43.631 app[90801e41c6dd28] sea [info] E20241006 11:33:43.631630 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:33:52.633 app[90801e41c6dd28] sea [info] E20241006 11:33:52.633008 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000
2024-10-06T11:34:01.634 app[90801e41c6dd28] sea [info] E20241006 11:34:01.634512 986 raft_server.cpp:785] 8375 queued writes > healthy read lag of 1000

I'm running into an issue where at a certain point of a bulk import that happens daily, the queued writes seem to just stop. This is just a small sample of the logs, it goes way back just sitting at 8375. The number it stops at is different over the last few days. This causes the health check to return false and no other operations are able to run. After about an hour it will continue.

Resource wise there are plenty of resources:

40GB free memory 0 swap used 30GB free HD space 16 cores

Stats:

{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {
    "GET /health": 0.0
  },
  "overloaded_requests_per_second": 0,
  "pending_write_batches": 8375,
  "requests_per_second": {
    "GET /health": 0.1
  },
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 0.1,
  "write_latency_ms": 0,
  "write_requests_per_second": 0
}

Any idea what this might be? Any help or direction for troubleshooting this is much appreciated.

Steps to reproduce

This happens after using /documents/import?action=upsert to upsert many jsonl documents.

Expected Behavior

Expect the queued writes to continue.

Actual Behavior

Queued writes stop writing.

Metadata

Typesense Version: 27.1

OS: Debian bookworm

kishorenc commented 3 weeks ago

After about an hour it will continue.

Are you saying that the write proceeds after some time automatically? Can you tell me if your collection has a large number of nested unique fields?

kdevan commented 3 weeks ago

Are you saying that the write proceeds after some time automatically?

Yeah I was surprised to see that.

Can you tell me if your collection has a large number of nested unique fields?

Flat collection of 27 fields, mostly strings with a few int.

It starts out with a little over 21,000 queued writes. It's taking about four hours to process these (that would be assuming there's no point that it halts. The halt would add an extra hour on top of that).

EDIT: The pipeline was actually running just now and it actually halts at exactly the same number of queued writes. It's right now halted at 8375.

This may have been an exception for some document being imported at that point. After today I'll know and if so I'll go ahead and close this issue.

Second update: I do believe this happens because of exceptional documents that are being imported at the time! For anyone else who sees something like this, just know that even if the log seems repetitive and that writes are halted or frozen, Typesense is doing its thing!

kishorenc commented 3 weeks ago

I'm curious to hear what was problematic/exceptional about these particular documents. Were they super large?

kdevan commented 3 weeks ago

Yeah I'm curious about that too. It could effect some other parts of the system as well! I'm going to do some digging and see if I can figure out which ones were part of that batch. I'll update here if I can figure anything out.