When syncing neutrino on a new testnet node with pprof enabled, I noticed the following memory blowup when downloading block headers in the addHeaders function:
ROUTINE ======================== github.com/lightninglabs/neutrino/headerfs.(*headerIndex).addHeaders.func1 in /Users/nsa/go/pkg/mod/github.com/lightninglabs/neutrino@v0.11.0/headerfs/index.go
5.50MB 11.42GB (flat, cum) 68.15% of Total
. . 145: // In order to ensure optimal write performance, we'll ensure that the
. . 146: // items are sorted by their hash before insertion into the database.
. . 147: sort.Sort(batch)
. . 148:
. . 149: return walletdb.Update(h.db, func(tx walletdb.ReadWriteTx) error {
. 512.02kB 150: rootBucket := tx.ReadWriteBucket(indexBucket)
. . 151:
. . 152: var tipKey []byte
. . 153:
. . 154: // Based on the specified index type of this instance of the
. . 155: // index, we'll grab the key that tracks the tip of the chain
. . 156: // so we can update the index once all the header entries have
. . 157: // been updated.
. . 158: // TODO(roasbeef): only need block tip?
. . 159: switch h.indexType {
. . 160: case Block:
. . 161: tipKey = bitcoinTip
. . 162: case RegularFilter:
. . 163: tipKey = regFilterTip
. . 164: default:
. . 165: return fmt.Errorf("unknown index type: %v", h.indexType)
. . 166: }
. . 167:
. . 168: var (
. . 169: chainTipHash chainhash.Hash
. . 170: chainTipHeight uint32
. . 171: )
. . 172:
. . 173: for _, header := range batch {
5.50MB 5.50MB 174: var heightBytes [4]byte
. . 175: binary.BigEndian.PutUint32(heightBytes[:], header.height)
. 11.41GB 176: err := rootBucket.Put(header.hash[:], heightBytes[:])
. . 177: if err != nil {
. . 178: return err
. . 179: }
. . 180:
. . 181: // TODO(roasbeef): need to remedy if side-chain
. . 182: // tracking added
. . 183: if header.height >= chainTipHeight {
. . 184: chainTipHash = header.hash
. . 185: chainTipHeight = header.height
. . 186: }
. . 187: }
. . 188:
. 4.05MB 189: return rootBucket.Put(tipKey, chainTipHash[:])
. . 190: })
. . 191:}
Headers are 80 bytes each, so 80 * 1.6M testnet blocks is roughly equal to 128M which is far less than the 11.41GB allocated in bbolt's Put function. Digging deeper, I found that the root of the problem was that out-of-order writes allocate more memory than in-order writes (block header bytes are hashes and are thus OOO) since OOO caches more of the B+ tree in memory. With the batch size of 2k, this extra memory becomes enormous.
(pprof) top
Showing nodes accounting for 15913.55MB, 98.33% of 16184.08MB total
Dropped 41 nodes (cum <= 80.92MB)
Showing top 10 nodes out of 18
flat flat% sum% cum cum%
9666.05MB 59.73% 59.73% 9666.05MB 59.73% github.com/coreos/bbolt.(*node).put
5431.23MB 33.56% 93.28% 5431.23MB 33.56% github.com/coreos/bbolt.(*node).read
293.47MB 1.81% 95.10% 5724.70MB 35.37% github.com/coreos/bbolt.(*Bucket).node
248.52MB 1.54% 96.63% 248.52MB 1.54% github.com/coreos/bbolt.(*Cursor).search
108.29MB 0.67% 97.30% 157.66MB 0.97% github.com/coreos/bbolt.(*Tx).allocate
85.54MB 0.53% 97.83% 85.54MB 0.53% github.com/coreos/bbolt.(*freelist).free
75.95MB 0.47% 98.30% 16181.02MB 100% github.com/lightningnetwork/lnd/channeldb.TestBucketMem
4.50MB 0.028% 98.33% 464.34MB 2.87% github.com/coreos/bbolt.(*node).spill
0 0% 98.33% 15485.13MB 95.68% github.com/coreos/bbolt.(*Bucket).Put
0 0% 98.33% 465.34MB 2.88% github.com/coreos/bbolt.(*Bucket).spill
A batch size of 100k takes ~10 seconds and uses ~2.7G of memory:
(pprof) top
Showing nodes accounting for 2.63GB, 98.66% of 2.66GB total
Dropped 27 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 31
flat flat% sum% cum cum%
1.38GB 52.03% 52.03% 1.38GB 52.03% github.com/coreos/bbolt.(*node).put
0.66GB 24.70% 76.73% 0.66GB 24.70% github.com/coreos/bbolt.(*node).read
0.21GB 7.87% 84.61% 0.21GB 7.87% github.com/coreos/bbolt.(*Cursor).search
0.17GB 6.28% 90.89% 0.17GB 6.28% github.com/coreos/bbolt.Open.func1
0.07GB 2.68% 93.56% 2.66GB 99.87% github.com/lightningnetwork/lnd/channeldb.TestBucketMem
0.04GB 1.50% 95.07% 0.04GB 1.50% github.com/coreos/bbolt.cloneBytes
0.04GB 1.46% 96.53% 0.70GB 26.16% github.com/coreos/bbolt.(*Bucket).node
0.03GB 1.01% 97.54% 0.03GB 1.01% github.com/coreos/bbolt.(*freelist).addSpan
0.02GB 0.62% 98.17% 0.02GB 0.62% github.com/coreos/bbolt.(*freelist).free
0.01GB 0.5% 98.66% 0.02GB 0.63% github.com/coreos/bbolt.(*Tx).write
So it's clear that a bigger batch size will help restrain memory with initial block header download (and should be faster!). There are some considerations here like how many block headers can reasonably fit in-memory before flushing them (100k may be too much ~ 8MB). Another change I implemented to limit memory was to set the InitialMmapSize parameter to 1GB so that it wasn't continually resized when the db grows. Resizing the mmap copies the entire B+ tree that's cached. There are some other optimizations, but probably out of scope here.
When syncing neutrino on a new testnet node with pprof enabled, I noticed the following memory blowup when downloading block headers in the
addHeaders
function:Headers are 80 bytes each, so 80 * 1.6M testnet blocks is roughly equal to 128M which is far less than the 11.41GB allocated in bbolt's
Put
function. Digging deeper, I found that the root of the problem was that out-of-order writes allocate more memory than in-order writes (block header bytes are hashes and are thus OOO) since OOO caches more of the B+ tree in memory. With the batch size of 2k, this extra memory becomes enormous.I have a test case that demonstrates OOO writes and the effect of bigger batch sizes in my local
lnd
fork (https://github.com/Crypt-iQ/lnd/tree/channeldb_bench_mem_neutrino_blowup_0106). The test case uses a total of 1.5M sample headers. With a batch size of 2k, the test takes ~80 seconds and uses ~16.1GB:A batch size of 100k takes ~10 seconds and uses ~2.7G of memory:
So it's clear that a bigger batch size will help restrain memory with initial block header download (and should be faster!). There are some considerations here like how many block headers can reasonably fit in-memory before flushing them (100k may be too much ~ 8MB). Another change I implemented to limit memory was to set the
InitialMmapSize
parameter to 1GB so that it wasn't continually resized when the db grows. Resizing the mmap copies the entire B+ tree that's cached. There are some other optimizations, but probably out of scope here.