trufflesuite / ganache-cli-archive

Fast Ethereum RPC client for testing and development. See https://github.com/trufflesuite/ganache for current development.
https://www.trufflesuite.com/ganache
MIT License
3.36k stars 695 forks source link

Bounty: Fix race conditions causing problems identified in #417 #453

Closed benjamincburns closed 5 years ago

benjamincburns commented 6 years ago

As a result of #450, in trufflesuite/ganache-core#41, @0xNPE gave us some much needed tests which reproduce the underlying problems behind #417. This bounty is for a fix for the underlying problem which those tests identify.

Bounty acceptance criteria:

  1. Bounty claimant must submit one or more pull requests to trufflesuite/ganache-core and/or its underlying dependencies (likely ethereumjs/ethereumjs-vm and/or ethereumjs/merkle-patricia-tree), and meet the necessary requirements defined by the projects' contributing guidelines for these PRs to be accepted by the project maintainers.
  2. Submissions must make the tests submitted in trufflesuite/ganache-core#41 pass.
  3. The tests submitted in trufflesuite/ganache-core#41 may be modified if necessary to demonstrate the fix, so long as the modified tests continue to demonstrate the underlying problem with the fixes reverted.
  4. Submitters should take care to avoid causing significant performance degradation to ganache-core or any other impacted projects.
  5. Submissions which merely make the tests pass but which do not fix the underlying cause of #417 will not be accepted for the bounty.

Important note: This bounty will not be paid out until all required PRs are accepted and merged.

Tips:

gitcoinbot commented 6 years ago

This issue now has a funding of 0.5 ETH (374.81 USD) attached to it.

owocki commented 6 years ago

happy new year! im going to spend some cycles this week sourcing someone to take this on..

0xNPE commented 6 years ago

FYI, https://github.com/ethereumjs/merkle-patricia-tree/pull/28 looks like a promising fix for TypeError: Cannot read property 'get' of undefined

I dug a little deeper into TypeError: Cannot read property 'pop' of undefined. It happens during ethereumjs-vm runTx when it flushes the cache to its trie after a couple of checkpoints, commit, and revert.

federicobond commented 6 years ago

Hi, I am the author of https://github.com/ethereumjs/merkle-patricia-tree/pull/28.

I also looked at the pop error and I think it's related to reading a non-existent trie node (maybe due to some confusion about the currently active DB).

IIRC, the problematic call stack goes like this:

// baseTrie.js
put
  findPath
    _walkTrie
      _lookupNode
        _walkTrie.processNode
          findPath.processNode
            _updateNode

Where _updateNode is then called with an undefined value for stack.

If you look at the _walkTrie function, you will find that the child processNode function calls cb without arguments if it does not receive a node (!node).

cb then does onDone.apply(null, returnValues) with an empty returnValues array (because walkController.return was never called and thus returnValues never reassigned in this case).

Thus, stack in _updateNode ends up undefined.

Also interesting to look at are the conditions under which _lookupNode (called from _walkTrie) returns a null value.

0xNPE commented 6 years ago

@federicobond your analysis of the call stack is similar to what I observed.

but I don't think the root cause can be fixed in merkle-patricia-tree.

I think the root cause pertains to ethereumjs-vm working on its cache of statemanager while in checkpoint mode, then when it tries to persist its cache after certain revert actions, there is a reference to a node that never existed in the db, hence your call stack occurs.

federicobond commented 6 years ago

That makes sense. I'll see if I find anything else.

gitcoinbot commented 6 years ago

The funding of 0.5 ETH (427.59 USD) attached has been claimed by @kammerdiener.

@kammerdiener, please leave a comment to let the funder (@owocki) and the other parties involved your implementation plan. If you don't leave a comment, the funder may expire your claim at their discretion.

kammerdiener commented 6 years ago

So a quick update on this. Based on what I have found I think that both issues are currently originating from ethereumjs/merkle-patricia-tree. I have fixed the get being undefined. Currently working on figuring out what is going on with pop coming back undefined.

kammerdiener commented 6 years ago

Believe I have fixed it all within ethereumjs/merkle-patricia-tree. The Pull Request is here: PR #30. I tried to take into account all of the information talked about from above and was sure to check @federicobond PR to help fix the _getDBs portion. All of the tests pass now with no errors.

kammerdiener commented 6 years ago

I went into the node_modules folder and replaced the items and am attaching a screenshot of it passing the race conditions test. Once the PR for Merkle Patricia Trees is merged then. Should I go ahead and submit a PR for this with it including the race conditions test (removing the skip command)?

screen shot 2018-01-06 at 8 22 00 pm
benjamincburns commented 6 years ago

@kammerdiener there are actually a couple of different competing submissions at this point. I'll be taking some time early this week to evaluate them. I realize now that I said that the goal was to make the tests pass, but the real goal is to fix #417. The best thing that you can do at this point is to manually test ganache-cli with your fix in place (you can use npm link to do this) and determine whether you observe either of the crashes identified in #417. This is how I'll be evaluating the proposed fixes as well, so it'll take a bit of time, I'm afraid.

benjamincburns commented 6 years ago

@roderik by any chance would you like to help with evaluating these fixes, as I believe you've been impacted by this bug quite regularly?

benjamincburns commented 6 years ago

@kammerdiener, as I feared, when testing your fix I observe a new issue. I believe this is due to what @0xNPE @federicobond were discussing above. That said, your changes do make ganache much more stable, so until a proper fix is found we will likely include this fix in future releases (provided that it is accepted by the ethereumjs/merkle-patricia-tree maintainers).

The newly observed issue:

Error: Error: LevelUpArrayAdapter named 'blockLogs' index out of range: index 190; length: 187
    at ganache-cli/build/lib.node.js:87740:23
    at ganache-cli/build/lib.node.js:87714:5
    at ganache-cli/build/lib.node.js:88160:12
    at ganache-cli/build/lib.node.js:87925:19
    at ganache-cli/build/lib.node.js:92548:7
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickCallback (internal/process/next_tick.js:98:9)
    at Object.InvalidResponse (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:41484:16)
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:329530:36
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176186:11
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:325200:9
    at XMLHttpRequest.request.onreadystatechange (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:328229:7)
    at XMLHttpRequestEventTarget.dispatchEvent (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176415:18)
    at XMLHttpRequest._setReadyState (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176705:12)
    at XMLHttpRequest._onHttpResponseEnd (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176860:12)
    at IncomingMessage.<anonymous> (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176820:24)
    at IncomingMessage.emit (events.js:164:20)
kammerdiener commented 6 years ago

Alright thanks for the information. I will start working trying to figure this out as well. Do you have any more information on how you caused this error?

benjamincburns commented 6 years ago

@kammerdiener it's fairly easy to do this with trufflesuite/ganache, as it'll crash somewhat readily when running truffle test on nearly any truffle project while the accounts page is visible. You can grab the metacoin truffle box if you don't have a project handy.

I recommend using the develop branch Ganache as well, as that will let you see your logs on crash a bit more easily by hitting the "report issue to github" button (just do me a favor and don't submit the issue that it opens up).

Also I'd expected ganache-cli to have good sourcemap support by now. I'm looking into this and will commit a fix to the relevant places once that's done. Will comment back here as well, hopefully with a mapped stack trace rather than the useless one I just posted.

kammerdiener commented 6 years ago

@benjamincburns I will be watching for it. Thanks for the tip. I will see what I can do to fix it and make it more stable!

benjamincburns commented 6 years ago

Also be advised that the error I reported will be visible as an error response in your tests. Ganache doesn't crash in this case, but it's also useless for future tests (rerunning tests w/o restarting Ganache just spits out the same error)

benjamincburns commented 6 years ago

@kammerdiener I managed to fix sourcemaps. I'll commit that to the develop branch of ganache-cli within the next few minutes.

In the mean time, here's the mapped stacktrace:


    at ganache-cli/build/webpack:/ganache-core/lib/database/leveluparrayadapter.js:51:1
    at ganache-cli/build/webpack:/ganache-core/lib/database/leveluparrayadapter.js:25:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/level-sublevel/shell.js:102:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/level-sublevel/nut.js:122:1
    at ganache-cli/build/webpack:/ganache-core/node_modules/cachedown/index.js:53:1
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickCallback (internal/process/next_tick.js:98:9)
    at Object.InvalidResponse (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:41484:16)
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:329530:36
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176186:11
    at node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:325200:9
    at XMLHttpRequest.request.onreadystatechange (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:328229:7)
    at XMLHttpRequestEventTarget.dispatchEvent (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176415:18)
    at XMLHttpRequest._setReadyState (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176705:12)
    at XMLHttpRequest._onHttpResponseEnd (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176860:12)
    at IncomingMessage.<anonymous> (node/v9.3.0/lib/node_modules/truffle/build/cli.bundled.js:176820:24)
    at IncomingMessage.emit (events.js:164:20)```
benjamincburns commented 6 years ago

Actually now that I read that a bit more carefully, it seems that it's actually broken the truffle side of things. I'll dig in a bit further and see if I can't figure out what's gone amiss.

benjamincburns commented 6 years ago

@kammerdiener I'm afraid I also just observed the original issue with your fix in place:

TypeError: Cannot read property 'pop' of undefined
    at CheckpointTrie.Trie._updateNode (ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:359:1)
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:106:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:460:1
    at processNode (ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:465:1)
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:503:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/baseTrie.js:179:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/util.js:75:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:52:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:269:1
    at ganache-cli/build/webpack:/merkle-patricia-tree/node_modules/async/lib/async.js:44:1
roderik commented 6 years ago

@benjamincburns gladly!

owocki commented 6 years ago

@kammerdiener did you still want to claim the work here, or should i release it to someone else?

kammerdiener commented 6 years ago

@owocki I'll work on it some more. I think I have an idea for the fix.

kammerdiener commented 6 years ago

@owocki I have exhausted all of my potential fixes. If you would go ahead and release it in the interest of getting this fixed sooner. I will look for other issues to try and tackle.

benjamincburns commented 6 years ago

@owocki are claims mutually exclusive, or can multiple people register a claim to the bounty at once? If only one claim can be registered at a time, I think it's best if we reject @kammerdiener's claim until he's resubmitted. I really appreciate your work, @kammerdiener, but I worry that people may see a claim for the bounty here and be put off from trying to solve the issue, when the truth is that the field is still wide open.

kammerdiener commented 6 years ago

@benjamincburns totally understand and totally agree. I am out of ideas as to what to check (although I will be watching for the solution just because of my own curiosity).

owocki commented 6 years ago

Hi @benjamincburns -- only one person can claim work for an issue at a time. We do this so that the network is not an enabler of spec work.

I'm rejecting @kammerdiener 's claim now.. Thanks for the work and the solid communication.

gitcoinbot commented 6 years ago

This claim for the funding of 0.5 ETH (516.33 USD) attached to this issue has been rejected and can now be claimed by someone else.

gitcoinbot commented 6 years ago

The funding of 0.5 ETH (491.36 USD) attached has been claimed .

If you are the claimee, please leave a comment to let the funder (@owocki) and the other parties involved your implementation plan. If you don't leave a comment, the funder may expire your claim at their discretion.

benjamincburns commented 6 years ago

I think the best (only?) way that this can be solved such that the band-aid fix of processing each request individually can be removed is if instances of merkle-patricia-tree are short-lived and not shared between requests. Rather, each instance should share the same backing db.

I had a local branch open a while ago that did this, and which also created instances of the EVM as needed, but I never managed to get it working with the forking feature.

clehene commented 6 years ago

The tests expose two different issues both in the https://github.com/ethereumjs/merkle-patricia-tree. There may be additional problems between the ganache doubles

I have both working, but need to cleanup code and will put up a patch

owocki commented 6 years ago

hey @clehene are you the claimee above?

clehene commented 6 years ago

@owocki hey, yes

benjamincburns commented 6 years ago

That's exciting news, @clehene! I'm eager to see what you've come up with!

benjamincburns commented 6 years ago

@clehene Just a tip - if you're going to test using Ganache like I described here, you'll want to be sure to disable the request queuing in ganache-core, found in lib/provider.js.

clehene commented 6 years ago

@benjamincburns hey first sorry for the delay. Second, I'm afraid I'll need a bit more info on how to do that test or the one mentioned in #417 - i.e. concrete instructions

I tried a few things, but I'm otherwise doing many things for the first time, so likely missing some context.

benjamincburns commented 6 years ago

I can't tell what would be request queueing (requestfunnel subprovider?)

Request queuing is here

Easiest way to disable it is to pass an options object with asyncRequestProcessing = false. You can hard code this in ganache in src/chain.js to make life a little easier.

if I do a truffle test in metacoin it works by default

Just a heads up that if you don't pass asyncRequestProcessing = false then this will likely be the case, as the queuing mechanism I linked to above sidesteps the issue.

if I do npm start in ganache and then truffle test in metacoin it fails with a complaint its' missing ConvertLib

If I remember correctly, ConvertLib is a contract which comes as part of the metacoin box. I'd recommend reinstalling the box.

benjamincburns commented 6 years ago

Also @clehene if you're thinking it's working well in ganache-cli, I'm happy to help test it in the Ganache UI if you're having difficulties - just need you to point me at the branch(es) which contain(s) your fix(es).

benjamincburns commented 6 years ago

@clehene actually, I can do one better than that. I've just pushed a branch to ganache-core called fix-forking-debug. In this branch I add test/forked_debug.js, which is essentially the same test as test/debug.js except I'm debugging a transaction from a forked chain. The changes I've made in that branch should have been enough to make things work, but unfortunately I run into the missing stack in _updateNode (aka Cannot read property 'pop' of undefined) crash.

I have a feeling that if your changes make this test pass, plus 0xNPE's tests, we'll be in really good shape.

clehene commented 6 years ago

later edit: missed the last few comments after my previous

@benjamincburns I spent some time over the weekend: I had a bit of trouble figuring out how to link modules to run the cli with my version of ganache-core. Guessing the two options you mentioned were _queueRequest vs `_sendAsnyc.

I had my initial work based on just the unit tests provided, so had to re-introduced some scaffolding in the otherwise cleaned up code.

The two unit tests address 2 distinct problems. Both are fixed. In short there's at least an additional one related to async ops under semaphores which will reproduce with truffle test and sendAsync but only under certain conditions, bot not captured by the unit tests.

In other words, I can probably spend some time later this evening to put up a patch to start the review and some validation, before a full assessment of the remaining stuff.

clehene commented 6 years ago

but unfortunately I run into the missing stack in _updateNode (aka Cannot read property 'pop' of undefined) crash.

@benjamincburns the current fix (which will need a lot of scaffolding cleanup so may take a bit of time this evening) fixes the missing stack and also a good degree of the get issues scenarios.

Asone commented 6 years ago

y9w6o_s-200x150

@clehene : I don't know how to say it, but thanks for the coming patch ! I've never been F5'ing an issue so much :p

This bug has been so annoying ! So thanks again a thousand times ! Also Thanks to the ganache-cli team for the tools you provide :3

I'll take a rare pleasure to read the patch code and try to improve my understanding of this awesome tool :D

P.S : however no rush, this was to thanks, not to rush anything, i'll wait the needed time to enjoy a clean fix :)

clehene commented 6 years ago

I fixed that last get undefined issue (and added a few additional locks in places where it was clear it could have caused issues). However this last issue was trickier and is partially a break of encapsulation problem: getRaw should probably be private but is called directly from Account.

To recap, when doing a checkpoint we end up with two dbs and all reads (getRaw->db.get) happen on both asynchronously. When things are committed or reverted the second db goes away and the in-flight get will try the second instance (think db[0] db[1]) that's not there anymore and fail with undefined .

This means that access needs to be synchronized around anything that would race for db instances. Many of the operations were already using semaphores for this however some didn't (e.g. checkpoint, get). The problem with the last race was due to direct access to getRaw from Account.prototype.getCode. Again IMO this is mostly an encapsulation problem as get should probably be used instead (and perhaps that's the right fix, although I'd say it would also require making getRaw "private". The alternative (which is what I'll include in the patch) for a merkle-patricia-tree only fix is to mark the under-lock state with a flag (param), based on which's absence (if flag is off) the semaphore can be acquired to avoid the race.

The pop issue, like I mentioned, was a logical one and happened due to propagation (or lack thereof) not found errors.

While most of this conversation has happened on ganache-cli all the problems are actually in merkle-patricia-tree.

Lastly, the merkle-patricia-tree code is more complex than it perhaps should given the functionality (deep callback stack, mutual recursion, heavy async, etc.). In addition the stuff around it, there may be other issues lying around,

As a caveat - now with the get / pop issues out of the way I can sometimes see https://github.com/trufflesuite/truffle/issues/558 https://github.com/trufflesuite/ganache-cli/issues/433

Initially I was concerned that they were somehow introduced by these fixes and tried to see how but that wasn't the case (not sure how they occur though) however I've confirmed them without the fixes as well.

clehene commented 6 years ago

https://github.com/ethereumjs/merkle-patricia-tree/pull/33

benjamincburns commented 6 years ago

Great work @clehene! I've added a comment to the PR, however bear in mind that I don't maintain that repo, so I can't accept that PR. I'll flag down some of the ethereumjs folks tomorrow and see if we can get someone to review it.

clehene commented 6 years ago

Quick update. summarizing the PR mention here as I closed it for now: As I updated multiple methods to use the semaphore, there were two without callbacks and I missed fully fixing them. As checkpoint takes a semaphore and changed from sync to async, there are implications to call sites and wanted to revalidate after changing the callsites. I also realized the second one was unnecessary after inspecting all possible ways to get to it and realizing it should be locked already. As I'm blowing in yogurt ;) I'm double checking the two assumed unrelated issues to be sure.

I don't think the changes would be to big / hard (I've made them in the libraries involved with truffle) and seem to be ok.

benjamincburns commented 6 years ago

Hi @clehene, how are things going? I can't help but feel like you were sooooo close with the changes you submitted last time. No pressure, but let me know if there's anything I can do to be of assistance!

clehene commented 6 years ago

@benjamincburns hey, I've integrated the changes in VM in order to use the async checkpoint. However I've started seeing the ou of gas errors after (The contract code couldn't be stored, please check your gas amount.). I don't think my fix is causing it necessarily, but it may make it occur more frequently now (https://github.com/trufflesuite/truffle/issues/558). Hence I've instrumented all libs to track things and went down the rathole tryin to fix that too. Found one or two more potential issues... Meanwhile had a hand surgery and a bit slower :). I can repost the patch with the latest changes and take it from there.

benjamincburns commented 6 years ago

Meanwhile had a hand surgery and a bit slower :)

"... with one hand tied behind his back!" :joy:

No pressure - definitely take it easy if you're still in recovery.

I appreciate you trying to sort out trufflesuite/truffle#558, but I think if you're reasonably confident that the change you've made is necessary to fix #417, then you've just exposed some other existing flaw in the process, not created a new one. I'm overjoyed if you want to take that on, too - but don't feel that it's necessary to capture this bounty.

ghost commented 6 years ago

@benjamincburns , @owocki the bounty shows up as "open" with "expired funds". It looks that someone is working on this, pleas clarify the status.