Crash during backfilling collection data

owocowyy commented 3 years ago

Description

Hi! I want to implement full text search into my project which is based on Firebase. Right now I'm trying to get some things rolling by backfilling my data to the TypeSense Collection. I'm using a Firebase TypeSense Extension from this repo. Everything went smoothly except the backfilling process. The ext-firestore-typesense-search-backfillToTypesenseFromFirestore cloud function gives me an error "Memory limit exceeded". Usually I fix that problem by deploying a function once again with higher available resources but I'm not sure if this is the proper solution. The test collection has a size of around 60MB and contains 53K documents. I was able to successfully export other collection which is a bit smaller (around 30k documents).

Steps to reproduce

Expected Behavior

The ext-firestore-typesense-search-backfillToTypesenseFromFirestore trigger function shouldn't crash.

Actual Behavior

If the collection is big enough the ext-firestore-typesense-search-backfillToTypesenseFromFirestore trigger function returns an error Function invocation was interrupted. Error: memory limit exceeded.

Metadata

Typsense Version: 0.21.0

jasonbosco commented 3 years ago

@owocowyy Thank you for reporting this. I made an incorrect assumption that this method automatically paginates and returns documents in small batches from Firestore:

https://github.com/typesense/firestore-typesense-search/blob/3bbb665ed082ec77858c8968ec586c8ba58114a6/functions/src/backfillToTypesenseFromFirestore.js#L34

It looks like that might not be the case, so ALL documents are being loaded into memory, exhausting it.

Let me see if there's a batch retrieval mechanism...

jasonbosco commented 3 years ago

@owocowyy Skimming through the docs I wasn't able to find a quick way to paginate through all the docs in a collection, without a sort field.

So for now I've increased the function's memory to 4x of what it used to be and published it as a new version: https://github.com/typesense/firestore-typesense-search/releases/tag/v0.2.5

owocowyy commented 3 years ago

That might resolve the issue but I think it isn't the optimal way of doing it. I know that Firestore has its own limitations and I didn't check the docs to find a better way so for now I think we can close this issue or mark it as "needs future improvements"

jasonbosco commented 3 years ago

Yeah, I agree this is not an ideal solution.

I'll close this for now, but I'll keep an eye out for any potential alternate solutions.

CaptainCodeman commented 2 years ago

The solution to this is to use a firestore query with cursors which allow you to iterate any size collection in batches, effectively re-starting from the last point each time. https://firebase.google.com/docs/firestore/query-data/query-cursors

The typical approach is to schedule a task to keep running a function, which itself schedules the next task to continue with until the entire collection has been iterated over. https://cloud.google.com/tasks/docs/tutorial-gcf

You can get even fancier by splitting up the key-space of the dataset so you can have multiple parallel processes running. Whether it's worthwhile really depends on how many records you have to process and how quickly you want it to happen. I used this approach with datastore, which was the forerunner to firestore, and it would work for tens of millions of records. https://github.com/CaptainCodeman/datastore-mapper

typesense / firestore-typesense-search