kriszyp / lmdb-js

Simple, efficient, ultra-fast, scalable data store wrapper for LMDB
Other
482 stars 39 forks source link

MDB_PROBLEM: Unexpected problem - txn should abort #153

Open LekoArts opened 2 years ago

LekoArts commented 2 years ago

Hi Kris! Every now and then a user reports this error. Or our CI or our testing itself runs into it:

MDB_PROBLEM: Unexpected problem - txn should abort

User error:

user-error

Our CI run (https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/81406/workflows/46bdcdf9-85eb-444a-aab1-b92845954af3/jobs/964466) failed with it intermittently:

MDB_PROBLEM: Unexpected problem - txn should abort

  1 | query pageHomecircleciprojectstartersdefaultsrcpagesusingTypescriptTsx2907560070 {
  2 |   site {
> 3 |     buildTime(formatString: "YYYY-MM-DD hh:mm a z")
    |     ^
  4 |   }
  5 | }
  6 |

File path: /home/circleci/project/starters/default/src/pages/using-typescript.tsx
Url path: /using-typescript/
Plugin: none

  Error: MDB_PROBLEM: Unexpected problem - txn should abort

  - open.js:156 
    [default]/[lmdb]/open.js:156:21

  - write.js:762 LMDBStore.transactionSync
    [default]/[lmdb]/write.js:762:17

  - open.js:155 new LMDBStore
    [default]/[lmdb]/open.js:155:11

  - open.js:218 LMDBStore.openDB
    [default]/[lmdb]/open.js:218:6

  - cache-lmdb.ts:60 GatsbyCacheLmdb.getDb
    [default]/[gatsby]/src/utils/cache-lmdb.ts:60:44

  - cache-lmdb.ts:69 GatsbyCacheLmdb.get
    [default]/[gatsby]/src/utils/cache-lmdb.ts:69:17

  - date.ts:234 formatDate
    [default]/[gatsby]/src/schema/types/date.ts:234:53

  - date.ts:318 Object.resolve [as originalResolver]
    [default]/[gatsby]/src/schema/types/date.ts:318:20

  - task_queues.js:93 processTicksAndRejections
    internal/process/task_queues.js:93:5

  - task_queues.js:62 runNextTicks
    internal/process/task_queues.js:62:3

  - timers.js:434 processImmediate
    internal/timers.js:434:9

  - async Promise.all

I'm not too well versed with all the moving pieces yet so I'll ask my colleague to maybe shed some more light on this, but wanted to report it nevertheless so that a) I don't forget it and b) we can get behind this :)

Files from the stacktrace:

Thanks for your work!

kriszyp commented 2 years ago

I don't think that I can tell what is occurring from this information alone, but I have added some additional details for the error messages when a MDB_PROBLEM occurs to better determine the cause (it is possible that is related #41, but not sure). I published this additional error information in the v2.3.0-beta3 release if you want to try that out.

LekoArts commented 2 years ago

Ok, thanks! I'll tell users to use this version until it's in a stable version then (when they can more reliably reproduce this).

kriszyp commented 2 years ago

Have you ever seen this error again (with the more detailed error message)? Still would be interested in knowing what caused this.

LekoArts commented 2 years ago

We had to pin lmdb (https://github.com/gatsbyjs/gatsby/pull/35397) so our own tests didn't run this version yet. (cc @pieh)

kriszyp commented 2 years ago

That issue (#156) has been fixed though, so you can use latest now, right?

LekoArts commented 2 years ago

We need to update https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby/src/schema/graphql-engine/lmdb-bundling-patch.ts first for the new binary loading -- but that's on us.

kriszyp commented 2 years ago

Ok, no rush, was just curious.

kriszyp commented 2 years ago

@LekoArts actually looking at the patch module you are using... would you like me to use more statically analyzable code like fs.readFileSync(require.resolve('../dict/dict.txt')) for reading the dictionary? Referencing binaries might be a bit better trickier though. But maybe could do something like listing them out:

try {
require.resolve('lmdb-win32-x64/node.napi.node');
require.resolve('lmdb-win32-x64/node.abi93.node');
require.resolve('lmdb-linux-x64/node.napi.node');
....
}catch(e){}

Or do you think that would just cause more errors/confusion with vercel (the require would be coming from a createRequire too)?

pieh commented 2 years ago

Hey @kriszyp - about the hacky patch we are using, ideally we don't need patch like that, but I did some attempts at playing with lmdb code before in hope I could submit pull request and frankly didn't exactly like what I was doing there and that's when I went with the "hacky patch" route.

dict.txt part problem was probably the easier of the 2 (the problem there IIRC was that lmdb package setup is that entry to package could be in different "directory levels" - depending if it's commonjs or ESM):

https://github.com/DoctorEvidence/lmdb-js/blob/4d6371dc6faaa69f2a1a43d980ace0f328c89106/package.json#L22-L32

The binaries loading part also shared that, but I think there were more problems there which I just don't remember (it was a while ago).

Other than above, with lmdb@2.3, node-gyp-build-optional-packages was introduced (vs node-gyp-build) which I suspect won't work with @vercel/webpack-asset-relocator-loader:

  1. It looks for specific things (like require('node-gyp-build'))
  2. Given https://github.com/kriszyp/node-gyp-build/commit/57be47b88bf4aea751bdbae93f90064eab5d960a they would actually need to run this "modified" load.path function from your forked package and not just try regular node-gyp-build (at least until https://github.com/prebuild/node-gyp-build/pull/45 land) I think.

For lmdb@2.3 the node-gyp-build-optional-packages is what's not working anymore with our patch and this is part that we will need to adjust to be able to bump it - as @LekoArts mentioned - that's on us.

In general - non-web-frontend npm packages have varying levels of "support" for being bundled and it's hard to expect them to work out-of-the-box for that, so I just didn't want to waste your time on our niche use case for lmdb package. It's also a bit fragile in general so we do prefer to pin versions as it is safer for our users (upside is things shouldn't randomly break, downside is we will be slower on updates than using ^ version selector or similar)

kriszyp commented 2 years ago

Ok, thanks for the explanation! And yeah hopefully the node-gyp-build PR merges soon. And again, no rush, was just checking.

wardpeet commented 2 years ago

This is the error message I get with the 2.3.0 beta

Error message ``` There was an error in your GraphQL query: MDB_PROBLEM: Unexpected problem - txn should abort: mdb_page_unspill no parent 1 | query DateQuery { 2 | dateTime: contentfulDate(contentful_id: {eq: "38akBjGb3T1t4AjB87wQjo"}) { 3 | title 4 | date: dateTime > 5 | formatted: dateTime(formatString: "D.M.YYYY - hh:mm") | ^ 6 | } 7 | dateTimeTimezone: contentfulDate(contentful_id: {eq: "6dZ8pK4tFWZDZPHgSC0tNS"}) { 8 | title 9 | date: dateTimeTimezone 10 | formatted: dateTimeTimezone(formatString: "D.M.YYYY - hh:mm (z)") 11 | } 12 | date: contentfulDate(contentful_id: {eq: "5FuULz0jl0rKoKUKp2rshf"}) { 13 | title 14 | date 15 | formatted: date(formatString: "D.M.YYYY") File path: /Users/ward/projects/gatsby/gatsby/e2e-tests/contentf ul/src/pages/date.js Url path: /date/ Plugin: none Error: MDB_PROBLEM: Unexpected problem - txn should abort: mdb _page_unspill no parent - open.js:177 [contentful]/[lmdb]/open.js:177:17 - write.js:771 LMDBStore.transactionSync [contentful]/[lmdb]/write.js:771:17 - open.js:176 new LMDBStore [contentful]/[lmdb]/open.js:176:11 - open.js:240 LMDBStore.openDB [contentful]/[lmdb]/open.js:240:6 - cache-lmdb.ts:60 GatsbyCacheLmdb.getDb [contentful]/[gatsby]/src/utils/cache-lmdb.ts:60:44 - cache-lmdb.ts:69 GatsbyCacheLmdb.get [contentful]/[gatsby]/src/utils/cache-lmdb.ts:69:17 - date.ts:234 formatDate [contentful]/[gatsby]/src/schema/types/date.ts:234:53 - date.ts:318 resolve [contentful]/[gatsby]/src/schema/types/date.ts:318:20 - task_queues:96 processTicksAndRejections node:internal/process/task_queues:96:5 - task_queues:65 runNextTicks node:internal/process/task_queues:65:3 - timers:437 processImmediate node:internal/timers:437:9 - async Promise.all ```

This the source code we're hitting

https://github.com/gatsbyjs/gatsby/blob/9b25267009f318949705e2e1faf7af859b0b668a/packages/gatsby/src/schema/types/date.ts#L224-L259 ``` const formatDate = async ({ date, fromNow, difference, formatString, locale = `en`, }: IFormatDateArgs): Promise => { const normalizedDate = JSON.parse(JSON.stringify(date)) if (formatString) { const cacheKey = `${normalizedDate}-${formatString}-${locale}` const cachedFormat = await getFormatDateCache().get(cacheKey) if (cachedFormat) { return cachedFormat as string } const result = moment .utc(normalizedDate, ISO_8601_FORMAT, true) .locale(locale) .format(formatString) await getFormatDateCache().set(cacheKey, result) return result } else if (fromNow) { return moment .utc(normalizedDate, ISO_8601_FORMAT, true) .locale(locale) .fromNow() } else if (difference) { return moment().diff( moment.utc(normalizedDate, ISO_8601_FORMAT, true).locale(locale), difference ) } return normalizedDate } ```

Let me know if this helps or not at all. We're basically checking if the element is in the cache if not we set it

kriszyp commented 2 years ago

That helps, that definitely is not where I had been expecting an MDB_PROBLEM to occur, this seems to point to the unspilling operation. A little explanation of this: in LMDB, if a transaction gets too large and has more than about 500MB of "dirty" data (that hasn't been committed to disk), it starts "spilling" or writing it to disk. But if that data is also modified again in the same transaction it is "unspilled" back to memory, and this appears to be where the error is occurring. It is not very clear how this code would trigger this, like you said, it looks just a basic get/put. Is this part of a very large number of writes in the same event turn/transaction (writing more than 500MB in the operation)? Are you using child transactions (childTransaction) at all? Of course if there is a way I can run a test locally to reproduce, that would be helpful.

I am investigating if this assertion/error really is necessary and the state is detrimental. It may be possible to simply turn off of this check or make it a warning (which basically allows unspilling even if a page can't be found in the spilled list), but will do some checking of the effects. Of course the potential problem is if this allows for a bad state where invalid data is returned or corruption can take place.

Also FYI, I have not removed these more detailed messages, you can use the latest lmdb version (2.3.5 right now), and the extra error messages should still be there.

wardpeet commented 2 years ago

I'll see if I can get you a repro. We're not using childTransactions and I don't event think we're using any transactions ourselves.

kriszyp commented 2 years ago

Well, everything in LMDB is in a transaction, it is just implicit/automatic if there is no explicit transaction, and usually one transaction per event turn, unless there is a lot of heavy writing where more gets batched into the current transaction.

kriszyp commented 2 years ago

Ok, v2.3.6 should have the assertion switched to just being a warning (with a little more info), so the application can proceed and we can see if it really is a bad state with detrimental outcomes, or can safely be ignored.

LekoArts commented 2 years ago

gatsby@next now uses lmdb 2.3.10 so going forward we should get better error logs and will post them here 👍

lauraturk commented 2 years ago

We're using Gatsby cloud for 5 staging sites and seeing this issue pop up multiple times a day. Per Gatsby support team's suggestion, I updated our gatsby version to 4.13.1 yesterday and 3 of the 5 sites failed with this error on that PR build. Please let me know how we can best support you all in finding a solution!

167911690-5d1468b5-b5da-48e4-888b-e7065d94fe52

https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/f3997d6e-da36-4555-8cba-e424b040cbe3/builds/5a3864d0-73d6-4348-9a1e-0e696ef0ed26/details#rawLogs https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/eb1574d3-8957-41cf-bd6b-c567a349ecc7/details#rawLogs https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/0acef4d7-fb7c-4e1d-bee5-649347611ebb/builds/90c8d4a2-2854-4e3d-8def-0da4f3e7ac57/details#rawLogs

kriszyp commented 2 years ago

@lauraturk FYI, we believe this may be related or the same issue #164, which also has plenty of discussion of our efforts to track this down. Certainly if you are ever able to come up with a reproducible test case that I could run locally, we would be most grateful (I've never been able to reproduce the issue in this ticket, so most of the discussion has been theorizing about ways that it might possibly occur).

lauraturk commented 2 years ago

Hi @kriszyp , unfortunately, I am not able to create a reproducible test case for you to run locally. It's very random across our 5 gatsby sites. Per Gatsby support I upgraded to gatsby@next today, and only one site failed. Hopefully this log is helpful? https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/cb4bfd92-c8a9-4d73-9ff6-b96b3d94ffe1/details#rawLogs

KyleAMathews commented 2 years ago

He can't access your build logs. Could you copy in the relevant lines?

On Fri, May 20, 2022, 3:09 PM Laura Turk @.***> wrote:

Hi @kriszyp https://github.com/kriszyp , unfortunately, I am not able to create a reproducible test case for you to run locally. It's very random across our 5 gatsby sites. Per Gatsby support I upgraded to @.*** today, and only one site failed. Hopefully this log is helpful? https://www.gatsbyjs.com/dashboard/f2f3e6f9-9b33-4fb6-92b2-f1d7711d4e15/sites/da690a16-3284-48d1-8e0f-4545d6169068/builds/cb4bfd92-c8a9-4d73-9ff6-b96b3d94ffe1/details#rawLogs

— Reply to this email directly, view it on GitHub https://github.com/DoctorEvidence/lmdb-js/issues/153#issuecomment-1133424246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARLBZIG2JN42MNASQW5B3VLAER5ANCNFSM5SVAVOBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

kriszyp commented 2 years ago

@lauraturk I would still love to see what logs you got for that error, if you can paste them in?

lauraturk commented 2 years ago

Thanks for your patience, here it is:

16:00:28 PM:
success Building HTML renderer - 28.962s

16:00:28 PM:
success Execute page configs - 0.126s

16:00:32 PM:
success Caching Webpack compilations - 0.001s

16:00:32 PM:
success Validating Rendering Engines - 4.071s

16:00:36 PM:
../dependencies/lmdb/libraries/liblmdb/mdb.c:7602: Assertion 'root > 1' failed in mdb_page_search()

16:00:36 PM:
Go here for troubleshooting tips: https://gatsby.dev/pqr-feedback

16:00:36 PM:
ERROR An error occurred during parallel query running.

16:00:36 PM:
16:00:36 PM:
    [www]/[gatsby-worker]/dist/index.js:117:45

16:00:36 PM:
  - index.js:117 ChildProcess.

16:00:36 PM:

16:00:36 PM:
  Error: Worker exited before finishing task

16:00:36 PM:
  - node:events:390 ChildProcess.emit

16:00:36 PM:
  - child_process:290 Process.ChildProcess._handle.onexit

16:00:36 PM:
    node:events:390:28

16:00:36 PM:

16:00:36 PM:

16:00:36 PM:
    node:internal/child_process:290:12

16:00:36 PM:

16:00:36 PM:
16:00:36 PM:
not finished run queries in workers - 3.943s
kriszyp commented 2 years ago

Thanks for the logs, this is helpful. This is a db corruption error, which is basically impossible to occur in typical usage. There are only few things that I know that can lead to this:

It is clear from these errors that there are multiple child processes (rather than just threads). This is fine to do with LMDB, but is a helpful insight. From this comment https://github.com/DoctorEvidence/lmdb-js/issues/164#issuecomment-1135060811 it sounds like that error might be specific cloud configuration where this occurs (GCP Kubernetes). I wonder if perhaps this configuration uses a remote/network drive/storage that isn't playing nicely memory maps (there are known problems with some remote file systems with LMDB )? I don't know enough about the gatsby infrastructure to know if this is the same platform where the error in this last message occurred?

kriszyp commented 2 years ago

I discovered and fixed an issue with file locking (causing locks to break), that should be fixed v2.4.5, and might possibly address this issue and #164.

LekoArts commented 2 years ago

The CircleCI tests in https://github.com/gatsbyjs/gatsby/pull/35724 seem to consistently fail on MDB_BAD_RSLOT: Invalid reuse of reader locktable slot: The reader lock pid 0, txn -1, doesn't match env pid 523 - at least when I re-trigger runs it fails very often: https://app.circleci.com/pipelines/github/gatsbyjs/gatsby/83182/workflows/29ded013-2e0a-4cb8-9837-b0d87062049c/jobs/990452

I can SSH into that pod and get you any info you need :)

kriszyp commented 2 years ago

Sorry, this is kind of trial and error. I believe this latest error is probably because the fix to better ensure database file identification was incorrectly handling the case of initial creation of the database (was doing the stat check before it was created). Published a fix for this in v2.5.1, if you can try it out when you get a chance.

LekoArts commented 2 years ago

I was OOO for a week, but I'll update the PR to 2.5.2 now and let you know how it goes :)

kriszyp commented 2 years ago

BTW, I would also be interested in adding an integration test of gatsby using lmdb-js. If you have any pointers/suggestion of a good test that would efficiently test gatsby's usage of lmdb, would love to add that.