Concurrent RPC requests against the ClaimTrie blocked critical section lock

tiger5226 commented 6 years ago

Summary

In syncing the ClaimTrie in LbryCRD with Chainquery a bottleneck was identified where a mutex lock on the critical section was blocking concurrent RPC connections from accessing the claimtrie in the same RPC call. The example is the getclaimsforname call. The mutex lock is located in several places:

Affected Code

getclaimsintrie https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L36

getclaimtrie https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L108

getvalueforname https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L182

getclaimsforname https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L308

getclaimbyid https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L373

getotalclaimednames https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L414

gettotalclaims https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L434

getotalvalueofclaims https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L456

getclaimsfortx https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L495

getnameproof https://github.com/lbryio/lbrycrd/blob/master/src/rpc/claimtrie.cpp#L720

Proposal

These locks are locking a critical section. It is used in many areas of bitcoin. However, since we are dealing with the claimtrie, we should not use a critical section lock. We should implement a shared mutex that allows for RW/RO shared/exclusive access.

The processing of a block should require exclusive access to the claimtrie due to the ability to add, delete, update claimtrie nodes. Since the RPC calls do not modify but read the trie these should provide for shared access allowing concurrent reads of the claimtrie.

It appears this lock was simply copied from other bitcoin rpc calls.

Use Case

The use case is where you are processing claim specific information where you can get the claims in the trie with getclaimsintrie, however subsequent calls to getclaimsforname are required to get certain fields like effective_amount or valid_at_height.

Reproduction 1 - Chainquery

chainquery.zip(linux) run: ./chainquery serve branch: https://github.com/lbryio/chainquery/tree/lbrycrdissue code: https://github.com/lbryio/chainquery/blob/lbrycrdissue/daemon/jobs/claimtriesync.go#L21

Outputs time per call of iterating through the claim names and getting the claims for each name from lbrycrd.

Reproduction 2 - Apache Benchmark

kauffj commented 6 years ago

This is a great ticket. Thank you @tiger5226 !

BrannonKing commented 6 years ago

I'm planning to work on this in the upcoming week.

tiger5226 commented 6 years ago

Excellent! Let me know if you need any assistance with the reproduction! Happy to help!

@shyba Also had a reproduction that did not make it into the issue but I wanted to get it from him for you, in case you would like to use an apache benchmark for the reproduction:

BrannonKing commented 6 years ago

In looking at this, I see three approaches (in order of difficulty):

Modify the cs_main lock to be shareable (reader/writer).
Separate out cs_main into multiple locks, with a shareable one for pclaimTrie.
Change the data structures to be immutable.

I'm planning to go with _1. This will add overhead to each of those lock calls; it assumes that the majority of our calls are reader operations and that each lock protects hundreds to thousands of CPU instructions. _2 would require some restructuring, and I'm not sure it would perform any better than _1. _3 would require more effort than I can commit to at this time.

As sub-options of _1:

Use the boost shareable/upgradable mutex.
Use the C++14 shareable mutex.
Pull the fast shareable mutex code from Bloomberg BSL.

Again, I'm planning sub-option _1, given the code precedent and my time constraints. Let me know if you have thoughts to the contrary or other ideas. Thanks.

tiger5226 commented 6 years ago

We probably want @lyoshenka and @kaykurokawa to chime in here, but I would think we want a separate lock for the claimtrie ( option 2). I am not sure of the implications of option one as that is used for the blockchain ( blocks,transactions etc) rpc calls.

BrannonKing commented 6 years ago

Other reasons I was leaning towards modifying cs_main: there are many places in main.cpp that could benefit from the reader/writer approach; it appears that most operations there are merely reading the present state. Very few places actually modify state. Hence, we would benefit at that level in addition to the RPC level. When the miner is running it does request the exclusive lock much more frequently, but that situation would be no worse than it is now. Second, many of the RPC methods use both pcoinsTip and pclaimTrie while main.cpp uses a combo of pcoinsTip and chainActive. I'm obviously new to this codebase; I don't have a firm grip on those relationships yet. However, I was hesitant to add in an additional lock to main.cpp. Nested locks always increase the deadlock risk.

tiger5226 commented 6 years ago

Excellent points. It sounds more risky to add a lock than to extend the current locks features. I am not familiar enough with main.cp to make a call. @lyoshenka , @kaykurokawa will respond soon. I assign it to them so it shows up in their queues.

lyoshenka commented 6 years ago

@kaykurokawa @lbrynaut @shyba can you guys please take a look at this and give Brannon some guidance? thanks :-)

kaykurokawa commented 6 years ago

I'm going to rule out 1) Modify the cs_main lock to be shareable (reader/writer)

We don't want to make big changes to any components that we inherited from Bitcoin Core unless if its absolutely necessary. This is because a) Bitcoin Core has far more eyes on it than we have and b) it will make it easier for us to pull upstream changes from Bitcoin Core

I think 2) is a much better option. I'll have to think a bit about how to best approach 2) but I just wanted to chime in with this thought for now.

BrannonKing commented 6 years ago

I discovered that approach 1) is more difficult than I first thought: the recursive lock mechanism doesn't work out-of-the-box for shared locks. I've come to despise recursive locks (even though I've used them for years in C#). I've seen some recursive shared locks on the web; I think I'd have to massage one of those before I trusted it. The other angle would be to map out all the LOCK(cs_main) calls and eliminate the recursive usage; this is fairly invasive and should have been done upstream years ago.

As for approach 2), I started on that here: https://github.com/BrannonKing/lbrycrd/tree/shared_locks_claimtrie . It still has several deadlocks in it. It struggles with the lack of a recursive shared lock as well; I think I can map out the call tree for that, though, as there are much fewer calls. I've been pondering on using a different type of critical section protection for that as well.

lbrynaut commented 6 years ago

@BrannonKing I didn't review in detail, but this approach looks like a dead-end. Boost upgrade/shared locks are the way to go, but I strongly agree with @kaykurokawa that there are many subtleties to playing with the cs_main lock, and it's not a quick fix to replace/modify the usage of it. Adding more locks in the meantime is not beneficial.

tiger5226 commented 6 years ago

@kaykurokawa ruled out 1 ( upgrade/shared locks) and it appears you are suggesting that while agreeing it is a big lift. So is the suggestion that this is not beneficial to solve? Just want to make sure @BrannonKing has the guidance he needs.

lbrynaut commented 6 years ago

@tiger5226 Compared to the implementation attempt of 2), I was commenting that upgrade locks are preferable to additional locks/critical sections.

I'm new to this specific issue, so I'll throw out some thoughts that may have been discussed before. I believe the existing subtleties of the locking are so nuanced, there's a very good reason they haven't been changed upstream -- it's not laziness. Personally, I don't have my head wrapped around this anymore, though I tried in the past and realized it was really nasty (admittedly years ago so it may have improved ... but probably not). When one has a full understanding of the implications of the locking, it's near trivial to decide on what solution makes the most sense. Perhaps it's even what's there. But I would estimate that's a difficult understanding to achieve in the first place (not impossible, just a trade-off of resources).

Since most claimtrie accesses are made under existing locks and avoid races, it seems to me that adding additional locks won't serve much purpose. If the goal instead is to remove/replace existing locks, see the above thoughts. So maybe I just don't understand what the issue is, or what is being attempted?

But if I do, my vote is to wait until upstream resolves this a bit more and instead explore alternative approaches in the meantime (like load-balancing across multiple daemons, caching, or whatever makes sense for what's really needed).

tiger5226 commented 6 years ago

All clear! Thanks a ton for the added thoughts, very helpful. I was able to work around this bottleneck rather easily by making a fraction of the calls. I am not sure anyone else is suffering from this issue, hence I don't believe there is any urgency.

@lyoshenka This makes sense and we probably don't want to take on the risk. Thoughts?

BrannonKing commented 6 years ago

I like the suggestion for load balancing daemons. I think a Docker Swarm or Kubernetes approach would work very well. Also, for the situation where you need to make multiple RPC calls for a single piece of data -- it makes a lot of sense to make a new RPC call that returns all the data that you need in a single pull. I think those are both great workarounds.

I did want to understand how recursive the cs_main mutex really was. I modified it to not be recursive and made the recursion checks at the individual locations. You can see my work here: https://github.com/BrannonKing/lbrycrd/tree/testNonRecursiveCsMain . I didn't finish the exercise; the daemon does not run, but it's close. Most of the tests run. Conclusively, almost all lock calls are recursive. As an aside, there is a unit test failure that seems legitimate; I left a comment in checkqueue.h about that. Also, see my fix in claimtrie_tests.cpp.

With that exercise I discovered that most of the recursive lock calls come from the unit test scenarios more than the actual code. At the same time, this library has no definitive public API. If any method is an entry point, who is responsible for the necessary preconditioning of each method?

I still intend to run an exercise using a shared recursive lock.

BrannonKing commented 6 years ago

I did make a branch with a recursive shared mutex for cs_main: https://github.com/BrannonKing/lbrycrd/tree/use_recursive_shared_lock. It compiles and runs successfully using some code I acquired and extended from CodeReview. I've been trying to benchmark it using the apache benchmark (ab) call shown above. I'm not sure how much data I would have to put in to make the trie lookup take any significant time. Can I just run the cli generate command for that? I've added 10k, but the lookup still takes hardly no time. Or does somebody with a lot of data want to run my branch?

kaykurokawa commented 6 years ago

I'm not sure we should merge this but I believe @tiger5226 should at least try this branch ( https://github.com/BrannonKing/lbrycrd/tree/use_recursive_shared_lock ) out and let us know if he seems any improvements and works properly for his chainquery application.

tiger5226 commented 6 years ago

@BrannonKing I would not mind testing it out with the chainquery branch. However, there is another issue blocking me from building lbrycrd on my mac pro. If you can get me the binary I can test it out.

kauffj commented 6 years ago

@kaykurokawa maybe it's time to sort out that mac build issue?

bvbfan commented 6 years ago

About me, code written by @BrannonKing looks good, however recursive shared lock is a mess as design. I would do protect only shared resources, i.e. like this

template <typename T, typename M = CCriticalSection>
class CCProtector {
    T *data;
    M *mutex;
    struct CCProxy {
        T *data;
        M *mutex;
        CCProxy(T *d, M *m) : data(d), mutex(m) 
        {
            mutex->lock();
        }
        ~CCProxy()
        {
            mutex->unlock();
        }
        T* operator->()
        {
            return data;
        }
    };
public:
    CCProtector(T *d, M *m) : data(d), mutex(m) {}
    CCProxy operator->()
    {
        return CCProxy(data, mutex);
    }
};
// usage
CCProtector<CClaimTrie> claim_guard(pClaimTrie, &cs_main);
claim_gurd->flattenTrie(); <--------- mutex->lock(); call flattenTrie; mutex->unlock();

BrannonKing commented 6 years ago

@bvbfan , if you look through the code, particularly in the main header, you'll see that many of the mutexes protect multiple data structures. That's quite unfortunate -- it makes it impossible to do as you suggest and make the protection explicit in code. I would prefer to take things a step farther; use concurrent or immutable data structures. IOW, fully encapsulate the mutexes into the structures that rely on them. However, it seems that the current goal is to keep the codebase close to the original bitcoin fork so that newer bitcoin codebase fixes can be merged in. (Hence, we're free to use better structures in the LBRY portion of the code, and should work to keep that as untangled in the bitcoin codebase as possible.)

bvbfan commented 6 years ago

The problem with readers/writers is that you can't be sure whatever thread is read or write. So let's see CCoinsViewCache::AccessCoins through CCoinsViewCache::FetchCoins it can, potentially, write to cacheCoins, it can be done while other thread reads it -> data race, right? Right approach should be - accessing structure through mutex, accessing element by copying.

BrannonKing commented 6 years ago

Waiting on #174

BrannonKing commented 4 years ago

While it's true that using a custom read/write recursive mutex was workable, it was a fairly large enhancement on existing code. It wasn't clear that this was a maintainable path going forward. With much of the data now easily accessible via sqlite, this seems less valuable. It could be revisited if and when upstream no longer uses a recursive mutex for cs_main

lbryio / lbrycrd