typesense / typesense

Open Source alternative to Algolia + Pinecone and an Easier-to-Use alternative to ElasticSearch ⚡ 🔍 ✨ Fast, typo tolerant, in-memory fuzzy Search Engine for building delightful search experiences
https://typesense.org
GNU General Public License v3.0
19.02k stars 585 forks source link

OOM for collection export #1067

Closed Bug-Reaper closed 11 months ago

Bug-Reaper commented 1 year ago

Description

Can't export a large collection, it maxes out memory usage, hangs, then crashes. Our database is primarily one large collection (approx 4.5million documents) and the typesense-server binary sits at around 26Gib memory usage while idle and total system memory is 64Gib.

We thought maybe with upgrade from 0.24.0 to 0.24.1 this may not be an issue. As per patch note:

[New in v0.24.1] Improved memory footprint of export API.

However in practice, export failed the same way for both versions. When we previously only had ~3 million records we were still able to export without issue.

Steps to reproduce

  1. Export a very large collection via collections/<CollectionName>/export endpoint.
  2. Memory maxes out and typesense-server crashes fatally.

Expected Behavior

  1. Make export request.
  2. Memory usage increases some.
  3. As request is processed and the data is sent, some of the already sent data is periodically free-ed from memory.
  4. Exported ljson file is successfully downloaded.

We'd also accept a slower export if it meant larger collections could be exported gracefully. Ideally nice if the database does not crash if OOM occurs but understand this may not be possible.

Actual Behavior

  1. wget 'http://<ServerIP>:80/collections/<LargeCollection>/documents/export?x-typesense-api-key=...'
  2. Memory usage increases linearly
  3. Memory usage maxed complete 64Gib system capacity.
  4. In total wget received around 34 Gib of collection data. Observed download rate plummeted to ~200kilobytes from ~250mbs when memory maxed out at the end.
  5. Incomplete collection data download and typesense-server crashed about a minute after memory usage was maxed out.
  6. Downtime of over an hour while typesense-server initializes and caches database back into memory.
Screenshot 2023-06-22 at 3 30 56 AM Screenshot 2023-06-22 at 3 28 25 AM

Typesense Version: 0.24.1

OS: Arch Linux (Amd-64 exec)

kishorenc commented 1 year ago

Will investigate and update shortly.

kishorenc commented 1 year ago

@Bug-Reaper

Thank you for reporting this issue. I've a patch. How do you deploy Typesense? Docker or binary? I can share the build.

Bug-Reaper commented 1 year ago

Cool @kishorenc happy to try it out.

We use the ARM-64 binary executable. EG:

curl -O https://dl.typesense.org/releases/0.24.1/typesense-server-0.24.1-linux-amd64.tar.gz

Curious does the patch include any other updates or just the upgraded export feat?

kishorenc commented 1 year ago

Please use this build.

Curious does the patch include any other updates or just the upgraded export feat?

It contains other changes from the upcoming 0.25 release as well. It's fully backward compatible with 0.24.1 and have many people already running these RC builds.

kishorenc commented 1 year ago

You might however still want to test it on staging or locally before proceeding just to be safe. Also once 0.25 loads the data and writes something to the data directory, 0.24.1 won't be able to read that data directory again. When we do GA release we will also release a 0.24.2 that can be used for roll backs if needed.

Bug-Reaper commented 1 year ago

Sweet! Hopefully gonna get to test it sometime over the weekend when downtime is more optimal.

We really only have the prod instance so I think I'm gonna increase the memory on the machine and try to get a clean export before we proceed with an irreversible upgrade. Though in general, your upgrades are high quality and are easy transitions. We also saw speed improvements to many queries from 24.0 => 24.1 so thanks for all the excellent improvements.


I have fairly minimal experience with snapshots though I think that API differs from export. Do you think it'd be possible to generate a snapshot with 24.1 without OOM errors we see when we attempt export? If so, maybe I'll do a snapshot and upgrade directly to the pre-release version w/o increasing system memory above 64Gib. Else if not or unsure I'll probably just increase memory first to be safe and get a clean export of data that is very recent.

kishorenc commented 1 year ago

If the goal is backups, then the snapshot API is certainly the best way to backup Typesense. It's also very fast and involves no memory overhead. Export is meant for smaller collections. Another way of backing up is just simply stopping the server and copying the data directory. The advantage of the snapshot API is that it allows you to backup without stopping Typesense, though.

Bug-Reaper commented 1 year ago

Neat! That's great to know. Since I'll have to reboot either way (to increase memory and/or upgrade to pre-release 0.25.0 version I'll be sure to just zip the data directory as a backup <3

Bug-Reaper commented 1 year ago

Please use this build.

Curious does the patch include any other updates or just the upgraded export feat?

It contains other changes from the upcoming 0.25 release as well. It's fully backward compatible with 0.24.1 and have many people already running these RC builds.

Hey there @kishorenc, sorry I miss-spoke. We use the AMD-64 version, not ARM. I think I was able to enumerate the correct link: https://dl.typesense.org/releases/0.25.0.rc38/typesense-server-0.25.0.rc38-linux-amd64.tar.gz.

Hope that's okay.

We backed up the DB via tar of it's --data-dir after it had been shut down for version upgrade. Did want a backup of it just in-case we have to rollback to 0.24.1 or suffer other data issues.

For reference, we do also need an .ljson export of our large collection as we use those in pipelines for batch updates with complex logic and/or misc computation.

Plan to report back in 1.5 hours when the DB is up again on whether or not we're able to export with this patch 👌

Bug-Reaper commented 1 year ago

@kishorenc Good news!

✅ The patch does allow us to export our very large collection. Exported all 46gib at 60-80mbs w/o hardly any memory increase 🎉 ✅ Upgrade to 0.25.0.rc38 success. Seems compatible w/most of our usage as well.


Less good news

On new 0.25.0.rc38 it appears our vector searches are slower than previous 0.24.1. Significant too, 400ms to 12,000ms+. Need to check if our requests need some config adjustment to get the speed back up. Any thoughts?

EDIT: For reference our request looks like const vector_query = resnet18:(${stringVec}, k:${resultsCount}) and it doesn't appear we set any other parameters to configure the vector_query. It also appears the request results are different than before.

kishorenc commented 1 year ago

@Bug-Reaper

That's odd. Would you be able to perhaps extract 1M vectors randomly (just the field) and then measure performance against the old and new versions and then share that dataset with me? We have to check where the regression has happened. You can share it with contact@typesense.org .

Bug-Reaper commented 1 year ago

@kishorenc

Difficult to perform tests w/our current setup of 1 machine and ~approx 1.5 hrs of DB init between version switches, happy to send the full data-dir snapshot I took before upgrade if you guys wanna play around with it.

In the meantime gonna check if there's tweaks to the request itself we can do to get things back to more normal speed. and gonna report any findings back here

Bug-Reaper commented 1 year ago

UPDATE:

So in debugging the increased latency, we also found that page prop was being ignored on our vector queries? returned the same page of results even though page number was consistently incremented. Testing cut short because the server has become non-responsive:

Log directory is configured as: ./logs
E20230627 11:23:12.680370 1019828 raft_server.cpp:649] 539 lagging entries > healthy write lag of 500
E20230627 11:23:12.680444 1019864 raft_server.h:62] Peer refresh failed, error: Doing another configuration change
E20230627 12:36:57.224555 1019716 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230627 12:37:11.733547 1019716 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230627 12:37:52.563654 1019716 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230627 12:38:21.172952 1019716 auth_manager.cpp:263] Scoped API keys can only be used for searches.
E20230627 15:47:11.202306 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:11.951409 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:12.195371 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:12.277573 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:14.916101 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:16.229625 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:16.513746 1019834 log.cpp:1203] Fail to close old open_segment or create new open_segment path: ./dat/state/log: Too many open files [24]
E20230627 15:47:16.513774 1019834 log_manager.cpp:464] Fail to append_entries, nappent=0, to_append=1
E20230627 15:47:16.513880 1019834 node.cpp:1980] node default_group:10.128.0.2:8107:8108 append [15772320, 15772320] failed
E20230627 15:47:16.513931 1019834 raft_server.h:279] Met peering error {type=LogError, error_code=5, error_text=`Fail to append entries'}
E20230627 15:47:16.637192 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:18.052093 1019716 raft_server.cpp:255] Rejecting write: could not find a leader.
E20230627 15:47:18.054280 1019716 raft_server.cpp:255] Rejecting write: could not find a leader.
E20230627 15:47:18.058246 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:18.095767 1019716 raft_server.cpp:255] Rejecting write: could not find a leader.
E20230627 15:47:18.095994 1019716 raft_server.cpp:255] Rejecting write: could not find a leader.
E20230627 15:47:31.542179 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:31.653540 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:31.696977 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.026829 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.048123 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.093588 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.111076 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.126507 1019848 collection.cpp:521] Update to disk failed. Will restore old document
E20230627 15:47:32.132615 1019848 collection.cpp:521] Update to disk failed. Will restore old document
... Continues on like this with some write failures too. Disk has plenty of available space.

Probably gonna revert to 24.1

kishorenc commented 1 year ago

I see:

Too many open files

You need to increase your ulimit -n to prevent this error. This is probably not related to this version (maybe you have been hovering on the edge of the limits).

Would you be to able to share maybe 10M records and a sample query that represents your query pattern that became slow? I don't need the full dataset.

Bug-Reaper commented 1 year ago

Ohh yea, you're super correct, thanks for pointing me in the right direction. Was able to modify the typesense-server process nofile limit via prlimit and even with increased limit couldn't recover from snapshot failure that occurred after failed disk updates. Eventually had to hard kill -9 typesense-server proc.

E20230627 16:23:16.168610 1019849 raft_server.cpp:990] Timed snapshot failed, error: FSMCaller is in bad status=`Fail to append entries', code: 22
E20230627 17:23:16.886297 1019864 raft_server.cpp:990] Timed snapshot failed, error: FSMCaller is in bad status=`Fail to append entries', code: 22

Have increased system default ulimit and am gonna try to run this version again.

Bug-Reaper commented 1 year ago

Well I figured out the pagination thing. We use the multi-search endpoint (as 512 dim vectors are too large to pass regularly). Had previously set page property adjacent to searches prop.

const multiSearch = {
                page:<PageNum>,
        searches: [
            {
                ...searchParams
            }
        ]
    };

Works fine if we move page to be set inside of searchParams directly.


Unfortunately, no idea why our vector searches have ballooned in latency from under 1 second to over 12-15 seconds. Tried a few things to see if I could figure out a work-around.

Likely gonna revert to 0.24.1 to restore optimal performance. @kishorenc if you send an email to llama@pulsr.ai I can drop you a link to our latest 0.24.1 data-dir backup as tar.gz easily sometime tomorrow. Appreciate your assistance, advice and investigation.

Bug-Reaper commented 1 year ago

Also discovered an interesting nuance that changes after we updated 0.24.1 to 0.25.0.rc38.

Take this example request where we specify the vector field resnet18 in the include_fields prop of our request:

Screenshot 2023-06-28 at 1 06 51 AM

In the response, the documents do not contain the resnet18 vector field. Example doc response:

{
  "document": {
    "aiLabels": [
      "cat",
      "art paint",
      "vertebrate",
      "felidae",
      "paint",
      "carnivore",
      "mammal",
      "creative arts",
      "art",
      "small to medium-sized cats"
    ],
    "asset": "tezos/KT1JTwRv4RNs1VK4AvacfNToyHotVWsRVPzT/2.png",
    "block": "2663354",
    "chain": "tezos",
    "collection": "Psychedelic Cats",
    "contract": "KT1JTwRv4RNs1VK4AvacfNToyHotVWsRVPzT",
    "created": 1661753744,
    "ext": "png",
    "id": "tezos~KT1JTwRv4RNs1VK4AvacfNToyHotVWsRVPzT~2",
    "isNsfw": false,
    "price": 2.37342,
    "size": 4454161,
    "title": "Omeng #3",
    "tokenId": "2",
    "type": "image"
  },
  "highlight": {},
  "highlights": [],
  "vector_distance": 0.22438204288482666
}

Previously in 0.24.1 the vector field resnet18 would have been included in the response docs, however in 0.25.0r38 it is mysteriously absent from the response docs.

We know the response docs have this vector field populated too because this a vector_query request on the resnet18 vector field.

kishorenc commented 1 year ago

@Bug-Reaper

Thank you for the investigation.

Works fine if we move page to be set inside of searchParams directly.

We've never supported search parameters outside the searches array in the multi_search body. I just tested the following on a 0.24.1 instance and it does not respect the per_page parameter mentioned outside: https://gist.github.com/kishorenc/769bd7606041d5c1e219ba87bdc2a7cd

Is it possible for you to provide an exact gist like that which show cases the difference between 0.24.1 and 0.25 RC? You can just use a Docker container on your local machine to test.

In the response, the documents do not contain the resnet18 vector field. Example doc response:

I tested the following gist on a 0.25 RC build and once again the include fields seemed to return the vector field properly: https://gist.github.com/kishorenc/865511899735ab23d82b29d8592751c5

Unfortunately, no idea why our vector searches have ballooned in latency

I will investigate this and get back to you if I need a sample set from you.

Bug-Reaper commented 1 year ago

Sweet @kishorenc

With ref to the page thing, not too concerned as it's an easy change and clearly not intended to work with page set the way we had it.


With ref to the missing include_fields on 0.25.0r38. We encountered another strange edge case where id is specified in the include fields however is not returned on the result documents. So it seems to not be specific to vector fields and instead simply that sometimes include fields are ignored.

Take a look at this live example req?

After some more experimentation, looks like the when >1 include fields are specified, the last include field is dropped. Here's some simpler examples on our live deployment:

After even more experimentation looks like you can remove spaces from in-between include fields and it fixes things EX:


With regards to the very slow vector queries, sounds good and keep us posted 🙏

kishorenc commented 1 year ago

@Bug-Reaper

After even more experimentation looks like you can remove spaces from in-between include fields and it fixes things

Ah, good catch! We will fix that.

With regards to the very slow vector queries, sounds good and keep us posted 🙏

I've emailed you about sharing a sample dataset with us. Details in the email.

Bug-Reaper commented 1 year ago

1️⃣

With regards to slow vector queries. We've sent an email that contains links to:

(Of note: We added a new facet field named md5 after upgrade (to consolidate downtime). The field has been completely un-used/un-populated as we've had other fish to fry. My current assumption is that this new field is unrelated to our post-upgrade performance issues, though this is untested).


2️⃣

We've discovered another query that has been slowed down after our upgrade from 0.24.1 to 0.25.0r38. It is a much simpler query where we just are updating the value of a boolean property.

On 0.25.0r38 it clocks in at around 7000ms-9000ms. I don't have an exact benchmark of what it was on 0.24.1 though I can say with confidence it was under 500ms for sure.

Query looks like:

tsClient.collections(<CollectionName>).documents(<AnyDocumentID>).update({isHidden:true});

On further testing I've seen it take all the way up to 45000ms or even exceed or 60000ms timeout window. It's interesting because I'm using the same few IDs and at a time when the database is under almost no load. Seems possible setting it to false is slower but there's a lot of variance and it's difficult to tell.


3️⃣

Final thing (lower priority / minor) a small number of the documents in the JSONL file generated via the 0.25.0r38 export endpoint seem to be corrupted in that we were unable to parse some lines as valid JSON via standard Node-JS JSON.parse(line).

This is not something we've experienced on any prior version with the export endpoint. It's also a relatively small number of documents affected (44 out of 4.5million or so).

In the email correspondence, I've included:


🙏

Hope this assists with debug and for a banger future 0.25.0 release. Keep us posted @kishorenc.

In the meantime, we plan to revert to 0.24.1 and increase the system memory to 128Gib as a temporary work-around for our export endpoint needs. Also to restore normal performance.

Bug-Reaper commented 1 year ago

Hey @kishorenc wanted to share that we were able to revert to 0.24.1 successfully 💯

✅ We're able to export now on that ver with 128Gib of memory and can double confirm that the exported jsonl file contains no un-parseable lines on 0.24.1 export. ✅ Update call for a single boolean prop that took anywhere between 7000-43000ms now takes around 1000-2000ms (So not quite as fast as I thought but much better).

We appreciate the Typesense team and all the great stuff you guys do! Look forward to the official 0.25.0 release and best of luck!

mhasibi commented 1 year ago

Hi @kishorenc and @jasonbosco ,

I have been trying to export documents in an index using the export function as below :

indexed_pages = client.collections['pages'].documents.export()

However, the Typesense server (I am using Docker) keeps going down. The error log is as follows :

image

image

And this is the logs on my Typesense Docker container from the start until going down :

image

image

image

image

And finally, this is the screenshot of resource usage by Typesense Docker image during the process :

image

Do you think you could help figuring out what the problem is and why the server keep falling over ?

P.S. I have used Typesense 0.24.1, 0.25.0.rc44 and 0.25.0.rc45 and the outcome has been the same.

kishorenc commented 1 year ago

@mhasibi

Can you please tell me at what point in the screenshot does the export begin?

mhasibi commented 1 year ago

@kishorenc

These are the screenshots of the logs right when the program starts to run until when the TS server goes down :

image

image

image

And these are screenshots of resources right when the program starts and goes for a bit. I have marked the starting point with red arrows :

image

I'm unable to take more screenshots of the resources as this is the only thing it shows me before the server goes down and everything goes away.

kishorenc commented 11 months ago

This should be comprehensively fixed in 0.25.0. Please reopen if it's not.