typesense / typesense

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
https://typesense.org
GNU General Public License v3.0
21.06k stars 651 forks source link

OpenAi downtime renders an entire database inoperable. #1314

Closed jsalts closed 2 months ago

jsalts commented 1 year ago

Description

OpenAi downtime can render a database inoperable. I haven't found any logs confirming this, but OpenAi API is currently down and the very small collection I have that uses an OpenAI API key is the only one not loading. I have no idea how I could even delete the collection to recover the rest of the collections since the database is in a non-ready state and API calls are being refused.

Steps to reproduce

  1. Break OpenAI ??
  2. Start Typesense

Expected Behavior

  1. Non-OpenAi-dependent collections should be queryable and fully functional regardless of the status of OpenAi.
  2. OpenAi-dependent collections should be read-only and not affect the overall status of the database. Ideally, queries against these collections should still work but inserts/updates should fail.

Actual Behavior

  1. If OpenAi is down, Typesense will fully index non-OpenAi dependent collections but not enter a "ready state."
  2. Typesense API is fully inaccessible (e.g. lagging or not ready).

Metadata

Typesense Version: 0.25.2.rc6

OS: Windows / WSL / Ubuntu

image image image

jasonbosco commented 1 year ago

Hmm, we did run into this scenario in earlier RC builds of 0.25.0 and we addressed several issues related OpenAI availability issues... May be we missed a spot.

Could you share all the logs from the time when you restarted the Typesense process?

jsalts commented 1 year ago

This is all I saw. It gets stuck in the 'Running GC for aborted requests' loop for a few hours then restarts overnight by itself. it finally worked when OpenAi came back online. I suppose there's also no direct evidence openAI is involved other than the collection that wasn't loading being OpenAi embeddings based.


2023-10-20T01:18:07.408771791Z I20231020 01:18:07.408672     1 typesense_server_utils.cpp:331] Starting Typesense 0.25.2.rc6
2023-10-20T01:18:07.408793921Z I20231020 01:18:07.408702     1 typesense_server_utils.cpp:334] Typesense is using jemalloc.
2023-10-20T01:18:07.409318756Z I20231020 01:18:07.409235     1 typesense_server_utils.cpp:384] Thread pool size: 192
2023-10-20T01:18:07.433301422Z I20231020 01:18:07.433161     1 store.h:64] Initializing DB by opening state dir: /data/db
2023-10-20T01:18:07.602221315Z I20231020 01:18:07.602072     1 store.h:64] Initializing DB by opening state dir: /data/meta
2023-10-20T01:18:07.671296627Z I20231020 01:18:07.671178     1 ratelimit_manager.cpp:546] Loaded 0 rate limit rules.
2023-10-20T01:18:07.671338206Z I20231020 01:18:07.671205     1 ratelimit_manager.cpp:547] Loaded 0 rate limit bans.
2023-10-20T01:18:07.672011956Z I20231020 01:18:07.671921     1 typesense_server_utils.cpp:495] Starting API service...
2023-10-20T01:18:07.672160668Z I20231020 01:18:07.672037   648 batched_indexer.cpp:124] Starting batch indexer with 192 threads.
2023-10-20T01:18:07.672190968Z I20231020 01:18:07.672041   647 typesense_server_utils.cpp:232] Since no --nodes argument is provided, starting a single node Typesense cluster.
2023-10-20T01:18:07.672194488Z I20231020 01:18:07.672101     1 http_server.cpp:178] Typesense has started listening on port 8108
2023-10-20T01:18:07.681730219Z I20231020 01:18:07.681609   647 server.cpp:1107] Server[braft::RaftStatImpl+braft::FileServiceImpl+braft::RaftServiceImpl+braft::CliServiceImpl] is serving on port=8107.
2023-10-20T01:18:07.681925971Z I20231020 01:18:07.681646   647 server.cpp:1110] Check out http://4cbdfcdb1cae:8107 in web browser.
2023-10-20T01:18:07.681968810Z I20231020 01:18:07.681914   647 raft_server.cpp:68] Nodes configuration: 172.18.0.2:8107:8108
2023-10-20T01:18:07.684540480Z I20231020 01:18:07.684453   648 batched_indexer.cpp:129] BatchedIndexer skip_index: -9999
2023-10-20T01:18:07.686767047Z I20231020 01:18:07.686689   647 log.cpp:690] Use murmurhash32 as the checksum type of appending entries
2023-10-20T01:18:07.688518889Z I20231020 01:18:07.688452   647 log.cpp:1172] log load_meta /data/state/log/log_meta first_log_index: 190577 time: 1736
2023-10-20T01:18:07.689813276Z I20231020 01:18:07.689745   647 log.cpp:1112] load open segment, path: /data/state/log first_index: 190425
2023-10-20T01:18:07.708946319Z I20231020 01:18:07.708814   666 raft_server.cpp:529] on_snapshot_load
2023-10-20T01:18:07.867927608Z I20231020 01:18:07.867751   666 store.h:299] rm /data/db success
2023-10-20T01:18:08.108788722Z I20231020 01:18:08.108649   666 store.h:309] copy snapshot /data/state/snapshot/snapshot_00000000000000190577/db_snapshot to /data/db success
2023-10-20T01:18:08.109511794Z I20231020 01:18:08.109411   666 store.h:64] Initializing DB by opening state dir: /data/db
2023-10-20T01:18:08.233064712Z I20231020 01:18:08.232915   666 store.h:323] DB open success!
2023-10-20T01:18:08.233112962Z I20231020 01:18:08.232950   666 raft_server.cpp:508] Loading collections from disk...
2023-10-20T01:18:08.233116722Z I20231020 01:18:08.232960   666 collection_manager.cpp:187] CollectionManager::load()
2023-10-20T01:18:08.235454608Z I20231020 01:18:08.235312   666 auth_manager.cpp:34] Indexing 0 API key(s) found on disk.
2023-10-20T01:18:08.235510218Z I20231020 01:18:08.235356   666 collection_manager.cpp:207] Loading upto 96 collections in parallel, 1000 documents at a time.
2023-10-20T01:18:08.235515368Z I20231020 01:18:08.235385   666 collection_manager.cpp:216] Found 3 collection(s) on disk.
2023-10-20T01:18:08.240631736Z I20231020 01:18:08.240458   884 collection_manager.cpp:137] Found collection strings with 4 memory shards.
2023-10-20T01:18:08.240677986Z I20231020 01:18:08.240481   883 collection_manager.cpp:137] Found collection games_metadata with 4 memory shards.
2023-10-20T01:18:08.240692826Z I20231020 01:18:08.240471   885 text_embedder_manager.cpp:13] Validating and initializing remote model: openai/text-embedding-ada-002
2023-10-20T01:18:08.240695176Z E20231020 01:18:08.240545   885 raft_server.cpp:973] Could not get leader url as node is not initialized!
2023-10-20T01:18:08.241996510Z I20231020 01:18:08.241889   884 collection_manager.cpp:1341] Loading collection strings
2023-10-20T01:18:08.244446857Z I20231020 01:18:08.244328   883 collection_manager.cpp:1341] Loading collection games_metadata
2023-10-20T01:18:08.660588844Z E20231020 01:18:08.660450   885 raft_server.cpp:973] Could not get leader url as node is not initialized!
2023-10-20T01:18:09.223153134Z E20231020 01:18:09.222995   885 http_proxy.cpp:75] Proxy call failed, status_code: 502, timeout_ms:  60000, try: 1, num_try: 2
2023-10-20T01:18:09.520472171Z E20231020 01:18:09.520327   885 http_proxy.cpp:75] Proxy call failed, status_code: 502, timeout_ms:  60000, try: 2, num_try: 2
2023-10-20T01:19:00.775669743Z I20231020 01:19:00.775506   884 collection_manager.cpp:1448] Loaded 32768 documents from strings so far.
2023-10-20T01:19:02.481384856Z I20231020 01:19:02.481215   883 collection_manager.cpp:1459] Indexed 8871/8871 documents into collection games_metadata
2023-10-20T01:19:02.481434895Z I20231020 01:19:02.481279   883 collection_manager.cpp:255] Loaded 1 collection(s) so far
2023-10-20T01:19:08.693822979Z I20231020 01:19:08.693536   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:19:52.651272947Z I20231020 01:19:52.651120   884 collection_manager.cpp:1448] Loaded 65536 documents from strings so far.
2023-10-20T01:20:08.353378793Z I20231020 01:20:08.353202   884 collection_manager.cpp:1459] Indexed 74387/74387 documents into collection strings
2023-10-20T01:20:08.353427313Z I20231020 01:20:08.353276   884 collection_manager.cpp:255] Loaded 2 collection(s) so far
2023-10-20T01:20:09.699325325Z I20231020 01:20:09.699157   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:21:10.705222422Z I20231020 01:21:10.705026   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:22:11.710983556Z I20231020 01:22:11.710799   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:23:12.717189999Z I20231020 01:23:12.717017   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:24:13.723272485Z I20231020 01:24:13.723096   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:25:14.729501129Z I20231020 01:25:14.729338   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0
2023-10-20T01:26:15.736289880Z I20231020 01:26:15.736106   648 batched_indexer.cpp:285] Running GC for aborted requests, req map size: 0```
rwojsznis commented 4 months ago

ideally, queries against these collections should still work but inserts/updates should fail

💯

Not sure if this is the same issue, but you can reproduce search-related problem on 26.0by setting custom url in model_config, indexing some data openai-compatible API/wrapper and then killing the API and doing the search that does query_by: <your embedding column>:

logs from docker compose-based experiments (set remote_embedding_timeout_ms to 10)

typesense-1  | E20240707 09:49:03.589577   134 http_client.cpp:194] CURL timeout. Time taken: 0.031921, method: POST, url: http://host.docker.internal:8082/v1/embeddings
typesense-1  | E20240707 09:49:03.589903   134 http_proxy.cpp:85] Proxy call failed, status_code: 408, timeout_ms:  10, try: 1, num_try: 2
typesense-1  | E20240707 09:49:03.597003   134 http_client.cpp:197] CURL failed. Code: 7, strerror: Couldn't connect to server, method: POST, url: http://host.docker.internal:8082/v1/embeddings
typesense-1  | E20240707 09:49:03.597097   134 http_proxy.cpp:85] Proxy call failed, status_code: 500, timeout_ms:  10, try: 2, num_try: 2

curl request to typesense multi_search hangs forever in another (terminal) tab - I would expect it would fail or ideally fallback to text search when doing a hybrid search

by looking at the logs alone I suspect there is some problem with distinguishing connection timeout vs some other higher-level networking problem? 🤔

kishorenc commented 3 months ago

This is fixed in 27.0.rc26