Support firestore transactions/batches

gresnick commented 2 years ago

https://firebase.google.com/docs/firestore/manage-data/transactions

Without this, cloud functions that share a trigger invariably enter a race condition.

gresnick commented 2 years ago

I would be happy to contribute this with some initial guidance

gmega commented 1 year ago

Indeed. This is a great project as you can get support for application-side schemas in Firestore while getting everything Pydantic has to offer (e.g. unlike other Firebase "ORMs" which have built their own stuff for schema definition), but without support for transactions we simply cannot adopt it.

antont commented 10 months ago

Any new thoughts here? I'm also considering Firedantic for our project, where have until now (just a few weeks) written a self baked simple db util for using pydantic for firebase quite nicely. It lacks a lot though.

Am just worried that might hit a wall somewhere with Firedantic.

I'd guess it's always possible to just use the firebase python sdk client etc. directly, bypassing Firedantic, e.g. for a batch op?

antont commented 10 months ago

I'd guess it's always possible to just use the firebase python sdk client etc. directly,

Just to answer my own question: yes, it seems trivial to fall back using the Client directly, am using it for more complex queries now and I guess running batch updates etc. would work somehow too.

lietu commented 10 months ago

It might not be too much work to accept an optional transaction argument to parameters so that you could use @firestore.transactional around firedantic yourself and pass in the transaction? If this seems valuable to you a PR could be interesting to see.

antont commented 10 months ago

Without this, cloud functions that share a trigger invariably enter a race condition.

What do you actually mean with this BTW? I guess two functions that get triggered by the same thing, like that they listen for document created in the same collection or whatever.. I haven't happened to do such functions yet, just have a single kind of handler per event, but I guess that can be nice easily.

lietu commented 10 months ago

Say you have cloud functions handling people submitting a form to add you to a newsletter list.

The cloud function both 1) adds you to a collection of newsletter subscribers and 2) updates statistics on subscribers per region, by extracting the list of subscribers, and counting their totals per region based on e.g. the email address domain, then saving the numbers to a collection containing the statistics

Now if your database can't just perform an atomic operation to do these two actions at once, there's a decent chance that some day there will be a rare occurrence (rarity heavily depends on the popularity of your service), that two people add themselves to the newsletter list at very nearly exactly at the same time.

Now your two cloud functions will spin up, not knowing about each other, and not synchronizing their work, both will

1) Add the user to the collection of newsletter subscribers 2) Extract the data 3) Calculate updated statistics 4) Store statistics

Now if we name these two users A and B, their requests might be processed in linear infinitely divisible time in this order:

A1
B1
A2
A3
A4
B2
B3
B4

.. so both entries were added to the list first, then they both calculated the statistics and updated the data - no problem.

But if the order instead is:

A1
A2
B1
A3
B2
B3
B4
A4

The end result will be .. wrong. A2 calculated the result before B1 added user B to the list. The request for B knew that - saved in B4, but A4 updated the wrong data to the DB afterwards. This is a race condition, which happens due to the inherent inpredictability of simultaneous actions and can be made a bit more interesting by the inpredictability of the speed at which they end up being executed.

How you'd work around this is either 1) transactions, or 2) locks

Locks:

Request A comes in, it acquires an exclusive lock to the database
A1
A2
Request B comes in, it asks for the lock, but fails to get it and either errors, or for this example waits for it
A3
A4
Request A completes, and releases the lock
Request B acquires the lock
B1
B2
B3
B4
Request B releases the lock

Final result is predictable and good.

Transactions are a bit more like:

Request A comes in, and starts a transaction, and inside the transaction performs these actions
A1
A2
Request B comes in, and starts a transaction ..
B1
A3
B2
A4
B2
B3
B4 - the database errors and says you're trying to update something that has changed state since you started your transaction, the request to make the change will be ignored, your transaction logic restarts.
B1
B2
B3
B4

This might not be exactly faithful for how it works out in practice, but this is roughly what race conditions are in general, and how these 2 different methods of solving the problem of race conditions work.

lietu commented 10 months ago

Also to add, locks are generally speaking a simpler thing to implement and comprehend, but come with their own scalability issues, which is partially why transactions are often preferred.

antont commented 10 months ago

Say you have cloud functions handling people submitting a form to add you to a newsletter list.

Right-o, thanks for the rautalanka. I think we currently avoid this by having such statistics like things triggered by scheduled cloud functions, so that only one task runs at a time for the whole service. Functions triggered by user activity only touch their own data. Will check our ops with this in mind anyway, and keep an eye on it for later.

I may also have some time to add support for this, also before we need to, just to be prepared once the need hits. Am curious if @gresnick or @gmega have ideas about how it would look, or if you write something I can at least test etc.

ioxiocom / firedantic

Support firestore transactions/batches #36