googleapis / nodejs-firestore

Node.js client for Google Cloud Firestore: a NoSQL document database built for automatic scaling, high performance, and ease of application development.
https://cloud.google.com/firestore/
Apache License 2.0
642 stars 150 forks source link

Documentation describing transactions as pessimistic is wrong should be optimistic #785

Closed dgobaud closed 5 years ago

dgobaud commented 5 years ago

On https://googleapis.dev/nodejs/firestore/latest/Transaction.html in various places such as here https://googleapis.dev/nodejs/firestore/latest/Transaction.html#get the lock is described as pessimistic "Retrieves multiple documents from Firestore. Holds a pessimistic lock on all returned documents."

This however is wrong. The way runTransaction works is it tries to commit changes at the end of the transaction and if it is unable to because of conflicts it retries - that is optimistic. You can have 2 transactions running at the same time operating on the same document - this is not pessimistic which "means no one else can manipulate that record until the user releases the lock. The downside is that users can be locked out for a very long time, thereby slowing the overall system response and causing frustration." https://en.wikipedia.org/wiki/Lock_(computer_science)

I wish Firestore had pessimistic locks! There are reasons to have pessimistic locks to guarantee only one process is doing something at a time - I've actually had to put some data in PostgresSQL and use its true pessimistic locks because of this https://www.postgresql.org/docs/9.4/explicit-locking.html (eg ROW EXCLUSIVE).

It is "Optimistic locking: this allows multiple concurrent users access to the database whilst the system keeps a copy of the initial-read made by each user. When a user wants to update a record, the application determines whether another user has changed the record since it was last read. The application does this by comparing the initial-read held in memory to the database record to verify any changes made to the record. Any discrepancies between the initial-read and the database record violates concurrency rules and hence causes the system to disregard any update request. An error message is generated and the user is asked to start the update process again. It improves database performance by reducing the amount of locking required, thereby reducing the load on the database server. It works efficiently with tables that require limited updates since no users are locked out. However, some updates may fail. The downside is constant update failures due to high volumes of update requests from multiple concurrent users - it can be frustrating for users."

merlinnot commented 5 years ago

Firestore supports both pessimistic and optimistic locks. All client-sides SDKs use optimistic locks, all server-side SDKs use pessimistic locks.

dgobaud commented 5 years ago

Above is how the node server SDK is working with docs saying optimistic it is incorrect

schmidt-sebastian commented 5 years ago

If you are reading a document using the runTransaction() API via a Server SDK the backend will prevent other clients from modifying the document until the transaction is committed. This lock can be held for up to 270 seconds. We refer to this as "pessimistic locking" in our documentation and all Firestore Server SDK expose these semantics.

There are some rare cases when the backend aborts currently running transactions (e.g. if there is too much contention), in which case all SDKs will retry, independent of how locks were obtained.

As @merlinnot pointed out, the Mobile SDKs use optimistic locking.

dgobaud commented 5 years ago

So it must mean if you have 2 runTransactions() running at once the first one gets the lock, the second one the function runs, fails, and retries correct?

This is optimistic locking behavior in the code calling runTransaction().

It is not pessimistic - compare it to how two calls that try to obtain an exclusive PostgresSQL lock work - one freezes will not go forward until it gets the lock.

schmidt-sebastian commented 5 years ago

If you have two transactions running that both try to read the same document, the first one will succeed and the second one will hang until the first one completes.

dgobaud commented 5 years ago

this is not what seems to happen using "firebase-admin": "^8.4.0",

info: User function triggered, starting execution
info: Original URL: /test
info: Method: GET
Path: /test
info: Protocol: http
info: IP: 127.0.0.1
info: hi
info: hi
info: done 1
info: done 2

const timeout = (ms: number) => new Promise((res) => setTimeout(res, ms));

app.use(
  '/test',
  asyncHandler(async (_req, _res) => {
    const firestore = firebaseAdmin.firestore();

    const docRef = firestore.doc('/beta/users');

    firestore.runTransaction(async (transaction) => {
      await transaction.get(docRef);

      console.log('hi');

      // tslint:disable-next-line:no-magic-numbers
      await timeout(5000);

      console.log('done 1');
    });

    firestore.runTransaction(async (transaction) => {
      await transaction.get(docRef);

      console.log('hi');

      // tslint:disable-next-line:no-magic-numbers
      await timeout(5000);

      console.log('done 2');
    });

    // tslint:disable-next-line:no-magic-numbers
    await timeout(10000);
  })
);
dgobaud commented 5 years ago

Note the post I had above about them keep retrying was actually 2 requests at the same somehow happened confused me - still it is like below the code retries on conflict.

here you can see the second transaction runs twice because of the conflict but because the first transaction has the shorter delay it is able to actually eventually get through then the second one gets through

info: User function triggered, starting execution
info: Original URL: /test
Method: GET
info: Path: /test
info: Protocol: http
info: Passfolio-App-Version: undefined
info: IP: 127.0.0.1
info: hi
info: hi
info: done 1
info: done 2
info: hi
info: Execution took 2009 ms, user function completed successfully
info: done 2

const timeout = (ms: number) => new Promise((res) => setTimeout(res, ms));

app.use(
  '/test',
  asyncHandler(async (_req, res) => {
    const firestore = firebaseAdmin.firestore();

    const docRef = firestore.doc('/beta/users');

    firestore.runTransaction(async (transaction) => {
      await transaction.get(docRef);

      console.log('hi');

      transaction.update(docRef, { test1: 'hi' });

      // tslint:disable-next-line:no-magic-numbers
      await timeout(500);

      console.log('done 1');
    });

    firestore.runTransaction(async (transaction) => {
      await transaction.get(docRef);

      console.log('hi');

      // tslint:disable-next-line:no-magic-numbers
      await timeout(1000);

      transaction.update(docRef, { test2: 'hi' });

      console.log('done 2');
    });

    // tslint:disable-next-line:no-magic-numbers
    await timeout(2000);

    res.send('');
  })
);
schmidt-sebastian commented 5 years ago

@dgobaud I talked to the backend team more about it: While the second transaction can read the document, its commit is blocked until the first transaction completes. If you need further clarification, I can let the backend team respond directly.

dgobaud commented 5 years ago

@schmidt-sebastian ok got it so the issue is still the documentation is wrong - that is optimistic locking. And the commit is not blocked - it seems that it errors causing the code to run again. I think it has to error because what has happened is 2 processes read some data, one of them is allowed to commit, now the other process has done processing based on invalid data and must start over to ensure correctness. The key thing is the docs are just wrong and should say optimistic.

dgobaud commented 5 years ago

@schmidt-sebastian any update on this? thanks

schmidt-sebastian commented 5 years ago

Unfortunately, it seems like we have a different understanding on the terminology that is being discussed here. We would like to thank you for your input, but we currently do not plan to update our documentation.

haoyuant commented 3 years ago

Hello, I ran into exactully the same issue, in which I thought with the admin SDK, the second transaction should hang and won't run at all until it got the lock, but it turned out retrying on data contention instead, which is very confusing.

Here is the code I used for testing:

import * as admin from 'firebase-admin';

const serviceAccount = require('./xxxxx.json')

admin.initializeApp({
  credential: admin.credential.cert(serviceAccount),
})

async function test() {
  const db = admin.firestore();
  const podRef = db.doc("/pods/pod-test");
  // 1
  db.runTransaction(async t => {
    console.log("1");
    const pod = (await t.get(podRef)).data();
    console.log("1-1");
    await new Promise(resolve =>
      setTimeout(resolve, 5000)
    );
    console.log("1-2");
    await t.update(podRef, { x: pod!.x + 1});
    console.log("1-3");
  });
  await new Promise(resolve =>
    setTimeout(resolve, 1000)
  );
  // 2
  db.runTransaction(async t => {
    console.log("2");
    (await t.get(podRef)).data();
    console.log("2-1");
    await t.update(podRef, { x: 4});
    console.log("2-2");
  });
}

test();

Outputs:

1
1-1
2
2-1
2-2
1-2
1-3
2
2-1
2-2

Apparently, the second transaction didn't wait for the unlock, instead, it retired on data contention. I also agree with @dgobaud, this is more like optimistic concurrency control instead of the pessimistic one.

dgobaud commented 3 years ago

It is definitely optimistic locking... ¯_(ツ)_/¯

dgobaud commented 3 years ago

maybe tweeting can help https://twitter.com/davidgobaud/status/1410024766875054086

schmidt-sebastian commented 3 years ago

There are two things going on here:

dgobaud commented 3 years ago

@schmidt-sebastian yes but this misses the point - it is optimistic locking not pessimistic. The documentation is wrong.

thebrianchen commented 3 years ago

@dgobaud @haoyuant To clarify, the underlying locking mechanism for writes in transactions is fundamentally pessimistic. Multiple clients can hold a lock to read a given document. However, to write to a document, the transaction needs to upgrade its read lock to a read-write lock, which can only be done if no other transaction is holding a lock on that document. When multiple clients contest for the lock, the oldest living transaction obtains the lock.

To verify this, you can run the following code snippet that's slightly modified from your example:

async function testTx() {
  const db = firestore;
  const podRef = db.doc("/pods/pod-test");
  await podRef.set({x:100});
  return db.runTransaction(async t => {
    console.log("outer transaction start");
    const pod = (await t.get(podRef)).data();
    console.log("outer tx fetched value:", pod);

    // Since the outer transaction holds a read lock, this transaction will
    // time out after ~20 seconds and retry obtaining the lock, running the entire
    // transaction block again.
    await db.runTransaction(async t => {
      console.log("inner transaction");
      const data = (await t.get(podRef)).data();
      console.log("inner transaction fetched data", data);
      t.update(podRef, { x: 4});
      console.log("inner transaction calling update");
    }).then(() => {
      console.log("inner transaction complete")
    }).catch((e) => {
      console.log("inner transaction failed with error: ", e);
    });

    // Without the above catch() statement, this code block is not reachable due to
    // the pessimistic locking mechanism.
    // rejects the block. The inner transaction will continue to attempt acquiring
    // the lock by rerunning the entire transaction block.
    t.update(podRef, { x: 10000 + 2});
    console.log("1-3");
  });
}
testTx();

In the above code snippet, the inner transaction will attempt to obtain the write lock, but is unable to due to the pessimistic policy. So while transaction callbacks do employ an optimistic approach in that they rollback and retry rather than wait when a document lock is unavailable, the underlying locking mechanism on a per-document basis is pessimistic.

Unlike Postgresql (https://www.postgresql.org/docs/9.1/explicit-locking.html), Firestore does not allow grabbing an actual lock on the data like a normal mutex. We currently don't have plans to introduce such behavior, but if this is important to you, please file an external ticket which gets aggregated for public voting.

I will update the documentation to reflect this nuance. Note that this is only the state of the SDK as of June 30, 2021, and future versions could alter this implementation as long as the overall semantics and correctness of transactions are preserved.

dgobaud commented 3 years ago

This all makes sense - the whole point is db.runTransaction is optimistic. Maybe lower level it is pessimistic. But db.runTransaction is optimistic.

jakebiesinger commented 2 years ago

@dgobaud after trying for longtime to understand the source of many "too much contention" error messages in firestore, I think your thinking here might be flawed, at least for the admin SDK.

I may be misreading. @thebrianchen and others, please correct me if I'm wrong...

The clincher is that in the example shown by @thebrianchen, the outer transaction has only done a READ on the doc (not yet a write) and yet the inner transaction is timing out (and then going into the catch block).

This has huge implications for a real system. Any txn READING a given doc will prevent other transactions that write to that doc from proceeding.

When I first started working with firestore, I assumed that all the txns that READ a doc would be forced to restart if a parallel WRITE occurred on documents that were read from... to me that's "optimistic locking". You happily proceed with your business but just before committing, you look back at what happened since your txn started and say "oh darn, someone else wrote to something I read! oh well, better restart!". But it's the opposite in reality-- instead, the semantics are "I can't write to that doc 'cause I see Billy is reading it. I'll just sit here until he's done with it and THEN I'll update it". No one can WRITE to a doc until everyone is done reading it, and THEN there's a write-lock on the doc and no one can read it for a bit.

Consider this situation:

For very "hot" documents (with lots of reads of style txnA), it's possible that txnB will be delayed substantially / fail entirely.

dgobaud commented 2 years ago

@jakebiesinger it is just different locking methods - yes pessimistic locks block and sometimes that is what you want. The point is the documentation is wrong not that the way locking is implemented is wrong.

jakebiesinger commented 2 years ago

I don't think the docs are wrong, per se, but I agree they're not clear.

Transaction.get(refOrQuery) → {Promise} Retrieve a document or a query result from the database. Holds a pessimistic lock on all returned documents.

The lock will PREVENT ALL WRITES to that doc. So @merlinnot @schmidt-sebastian @thebrianchen perhaps the documentation could be updated to drop the optimistic/pessimistic verbiage altogether since its complicated...

I've been working with firestore extensively for ~18 months now in a complex application and the "how firestore locking works" penny finally dropped for me last week, in spite of reading all the docs available many times and this very bug over and over at different points several months apart.

How about we just say something like:

Transaction.get(refOrQuery) → {Promise} Retrieve a document or a query result from the database. Holds a lock that prevents writes to all returned documents until this transaction resolves or restarts.

dgobaud commented 2 years ago

If something is complicated you need arguably more clear and detailed docs and of course correct.

doom-goober commented 2 years ago

The Firestore documentation seem to double down on this misunderstanding: https://cloud.google.com/firestore/docs/transaction-data-contention . That documentation explicitly states that client SDKs are optimistic and retry but server SDKs are pessimistic, lock, and resolve locks sequentially.

Data contention in the mobile/web SDKs
Optimistic concurrency controls
Based on the assumption that data contention is not likely or that it's not efficient to hold database locks. Optimistic transactions do not use database locks to block other other operations from changing data.
Mobile/web SDKs use optimistic concurrency controls, because they can operate in environments with high latency and an unreliable network connection. Locking documents in a high latency environment would cause too many data contention failures.

In the Mobile/Web SDKs, a transaction keeps track of all the documents you read inside the transaction. The transaction completes its write operations only if none of those documents changed during the transaction's execution. If any document did change, the transaction handler retries the transaction. If the transaction can't get a clean result after a few retries, the transaction fails due to data contention.
Data contention in the server client libraries
The server client libraries (C#, Go, Java, Node.js, PHP, Python, Ruby) use pessimistic concurrency controls resolve data contention.

Pessimistic concurrency controls
Based on the assumption that data contention is likely. Pessimistic transactions use database locks to prevent other operations from modifying data.
Server client libraries use pessimistic concurrency controls, because they assume low latency and a reliable connection to the database.
In the server client libraries, transactions place locks on the documents they read. A transaction's lock on a document blocks other transactions, batched writes, and non-transactional writes from changing that document. A transaction releases its document locks at commit time. It also releases its locks if it times out or fails for any reason.

When a transaction locks a document, other write operations must wait for the transaction to release its lock. Transactions acquire their locks in chronological order.

However, the documentation for server SDK is... odd. It says it locks on the first transactional read and that lock prevents other transactional writes. If that's the case then this scenario: Process 1: Reads. Process 2: Reads. Process 2: Writes. Process 1: Writes. That means that while Process 2 is attempting to write, it will be blocked from writing because Process 1 has a lock on it. When Process 1 is done writing, Process 2 can write. But that leads to a data corruption as Process 2 overwrites Process 1's work. The work around to this is to cancel Process 2's write and retry it from the start.

I think what is happening is that the documentation is leading us to believe that there should be no retries if we're using the server SDK because the documentation doesn't say retry. But that's not what's happening. Server or Client SDK both are using retries. The documentation is simply MISLEADING and is going off into the weeds about how early the retry can be detected without adding much benefit to actual users.

Conclusion: Both the server SDK and the client SDK will re-run the transaction function if contention is detected and fail if there is too much contention. Ignore all the "special" documentation around contention and the server SDK. The server SDK is not special in any meaningful way and for most users the client SDK and server SDK behave the same way.

schmidt-sebastian commented 2 years ago

The Server SDKs retry transaction if there is too much contention in the overall system, which indicates that the backend cannot successfully place or uplhold a pessimist lock. Server SDK transaction retries should be rare, and they block other clients from making modifications.

Mobile transactions fail whenever another clients issues a concurrent write. These concurrent writes are not blocked or queued and hence failures for mobile transactions can be triggered quite easily.

mlev commented 2 years ago

I agree that the documentation is confusing. As @doom-goober and others outline above, given only writes are blocked then for 2 transactions updating the same document - retries can’t be avoided. This is described nicely in the Optimistic Locking section of the document.

The transaction completes its write operations only if none of those documents 
changed during the transaction's execution. If any document did change, the 
transaction handler retries the transaction.

It seems like the difference between client/server is that a server transaction will wait to obtain a write lock before attempting its writes and then retrying as needed. The client will try immediately.

Vaibhavriju commented 2 months ago

We are going through this issue in production, while the documentation says only Mobile/Web SDKs retry we are experiencing retries in server client Node js libraries as well. schmidt-sebastian

`Optimistic concurrency controls Based on the assumption that data contention is not likely or that it's not efficient to hold database locks. Optimistic transactions do not use database locks to block other other operations from changing data. Mobile/web SDKs use optimistic concurrency controls, because they can operate in environments with high latency and an unreliable network connection. Locking documents in a high latency environment would cause too many data contention failures.

In the Mobile/Web SDKs, a transaction keeps track of all the documents you read inside the transaction. The transaction completes its write operations only if none of those documents changed during the transaction's execution. If any document did change, the transaction handler retries the transaction. If the transaction can't get a clean result after a few retries, the transaction fails due to data contention.`

Pessimistic concurrency controls Based on the assumption that data contention is likely. Pessimistic transactions use database locks to prevent other operations from modifying data. Server client libraries use pessimistic concurrency controls, because they assume low latency and a reliable connection to the database.

Note: To best ensure low latency, use server client libraries from a Google Cloud compute product as close to your Cloud Firestore database as possible. A server client library with a high latency connection can run into issues with locking and data contention. In the server client libraries, transactions place locks on the documents they read. A transaction's lock on a document blocks other transactions, batched writes, and non-transactional writes from changing that document. A transaction releases its document locks at commit time. It also releases its locks if it times out or fails for any reason.

When a transaction locks a document, other write operations must wait for the transaction to release its lock. Transactions acquire their locks in chronological order.