Memory blowup on initial block header download

When syncing neutrino on a new testnet node with pprof enabled, I noticed the following memory blowup when downloading block headers in the addHeaders function:

ROUTINE ======================== github.com/lightninglabs/neutrino/headerfs.(*headerIndex).addHeaders.func1 in /Users/nsa/go/pkg/mod/github.com/lightninglabs/neutrino@v0.11.0/headerfs/index.go
    5.50MB    11.42GB (flat, cum) 68.15% of Total
         .          .    145:   // In order to ensure optimal write performance, we'll ensure that the
         .          .    146:   // items are sorted by their hash before insertion into the database.
         .          .    147:   sort.Sort(batch)
         .          .    148:
         .          .    149:   return walletdb.Update(h.db, func(tx walletdb.ReadWriteTx) error {
         .   512.02kB    150:       rootBucket := tx.ReadWriteBucket(indexBucket)
         .          .    151:
         .          .    152:       var tipKey []byte
         .          .    153:
         .          .    154:       // Based on the specified index type of this instance of the
         .          .    155:       // index, we'll grab the key that tracks the tip of the chain
         .          .    156:       // so we can update the index once all the header entries have
         .          .    157:       // been updated.
         .          .    158:       // TODO(roasbeef): only need block tip?
         .          .    159:       switch h.indexType {
         .          .    160:       case Block:
         .          .    161:           tipKey = bitcoinTip
         .          .    162:       case RegularFilter:
         .          .    163:           tipKey = regFilterTip
         .          .    164:       default:
         .          .    165:           return fmt.Errorf("unknown index type: %v", h.indexType)
         .          .    166:       }
         .          .    167:
         .          .    168:       var (
         .          .    169:           chainTipHash   chainhash.Hash
         .          .    170:           chainTipHeight uint32
         .          .    171:       )
         .          .    172:
         .          .    173:       for _, header := range batch {
    5.50MB     5.50MB    174:           var heightBytes [4]byte
         .          .    175:           binary.BigEndian.PutUint32(heightBytes[:], header.height)
         .    11.41GB    176:           err := rootBucket.Put(header.hash[:], heightBytes[:])
         .          .    177:           if err != nil {
         .          .    178:               return err
         .          .    179:           }
         .          .    180:
         .          .    181:           // TODO(roasbeef): need to remedy if side-chain
         .          .    182:           // tracking added
         .          .    183:           if header.height >= chainTipHeight {
         .          .    184:               chainTipHash = header.hash
         .          .    185:               chainTipHeight = header.height
         .          .    186:           }
         .          .    187:       }
         .          .    188:
         .     4.05MB    189:       return rootBucket.Put(tipKey, chainTipHash[:])
         .          .    190:   })
         .          .    191:}

Headers are 80 bytes each, so 80 * 1.6M testnet blocks is roughly equal to 128M which is far less than the 11.41GB allocated in bbolt's Put function. Digging deeper, I found that the root of the problem was that out-of-order writes allocate more memory than in-order writes (block header bytes are hashes and are thus OOO) since OOO caches more of the B+ tree in memory. With the batch size of 2k, this extra memory becomes enormous.

I have a test case that demonstrates OOO writes and the effect of bigger batch sizes in my local lnd fork (https://github.com/Crypt-iQ/lnd/tree/channeldb_bench_mem_neutrino_blowup_0106). The test case uses a total of 1.5M sample headers. With a batch size of 2k, the test takes ~80 seconds and uses ~16.1GB:

(pprof) top
Showing nodes accounting for 15913.55MB, 98.33% of 16184.08MB total
Dropped 41 nodes (cum <= 80.92MB)
Showing top 10 nodes out of 18
      flat  flat%   sum%        cum   cum%
 9666.05MB 59.73% 59.73%  9666.05MB 59.73%  github.com/coreos/bbolt.(*node).put
 5431.23MB 33.56% 93.28%  5431.23MB 33.56%  github.com/coreos/bbolt.(*node).read
  293.47MB  1.81% 95.10%  5724.70MB 35.37%  github.com/coreos/bbolt.(*Bucket).node
  248.52MB  1.54% 96.63%   248.52MB  1.54%  github.com/coreos/bbolt.(*Cursor).search
  108.29MB  0.67% 97.30%   157.66MB  0.97%  github.com/coreos/bbolt.(*Tx).allocate
   85.54MB  0.53% 97.83%    85.54MB  0.53%  github.com/coreos/bbolt.(*freelist).free
   75.95MB  0.47% 98.30% 16181.02MB   100%  github.com/lightningnetwork/lnd/channeldb.TestBucketMem
    4.50MB 0.028% 98.33%   464.34MB  2.87%  github.com/coreos/bbolt.(*node).spill
         0     0% 98.33% 15485.13MB 95.68%  github.com/coreos/bbolt.(*Bucket).Put
         0     0% 98.33%   465.34MB  2.88%  github.com/coreos/bbolt.(*Bucket).spill

A batch size of 100k takes ~10 seconds and uses ~2.7G of memory:

(pprof) top
Showing nodes accounting for 2.63GB, 98.66% of 2.66GB total
Dropped 27 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 31
      flat  flat%   sum%        cum   cum%
    1.38GB 52.03% 52.03%     1.38GB 52.03%  github.com/coreos/bbolt.(*node).put
    0.66GB 24.70% 76.73%     0.66GB 24.70%  github.com/coreos/bbolt.(*node).read
    0.21GB  7.87% 84.61%     0.21GB  7.87%  github.com/coreos/bbolt.(*Cursor).search
    0.17GB  6.28% 90.89%     0.17GB  6.28%  github.com/coreos/bbolt.Open.func1
    0.07GB  2.68% 93.56%     2.66GB 99.87%  github.com/lightningnetwork/lnd/channeldb.TestBucketMem
    0.04GB  1.50% 95.07%     0.04GB  1.50%  github.com/coreos/bbolt.cloneBytes
    0.04GB  1.46% 96.53%     0.70GB 26.16%  github.com/coreos/bbolt.(*Bucket).node
    0.03GB  1.01% 97.54%     0.03GB  1.01%  github.com/coreos/bbolt.(*freelist).addSpan
    0.02GB  0.62% 98.17%     0.02GB  0.62%  github.com/coreos/bbolt.(*freelist).free
    0.01GB   0.5% 98.66%     0.02GB  0.63%  github.com/coreos/bbolt.(*Tx).write

So it's clear that a bigger batch size will help restrain memory with initial block header download (and should be faster!). There are some considerations here like how many block headers can reasonably fit in-memory before flushing them (100k may be too much ~ 8MB). Another change I implemented to limit memory was to set the InitialMmapSize parameter to 1GB so that it wasn't continually resized when the db grows. Resizing the mmap copies the entire B+ tree that's cached. There are some other optimizations, but probably out of scope here.

lightninglabs / neutrino

Memory blowup on initial block header download #196