near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.33k stars 628 forks source link

Investigate the performance of the history replay tool #5707

Open bowenwang1996 opened 2 years ago

bowenwang1996 commented 2 years ago

According to https://github.com/near/nearcore/pull/5697#issue-1073190158, we can only play back 10 blocks per second even with multiple threads. This does not align with our expectations and it blocks the contract runtime team from validating the wasmer2 and wasmer0 give the exact same results on the entire mainnet history, which would allow us to completely remove wasmer0 support.

@olonho did some initial investigation and found that we spend a suspicious amount of time in kernel (see the profile below) ![image](https://user-images.githubusercontent.com/13601726/145128257-4b3de305-1c1f-4f02-84f0-cd857eff4020.png)

bowenwang1996 commented 2 years ago

@olonho do you want to investigate this yourself?

olonho commented 2 years ago

I'm looking on that, but second pair of eyes wouldn't hurt.

olonho commented 2 years ago

Curious stats from about a day of playback:

Processed 10000 blocks, 9 blocks per second, 4357195 secs remaining
Processed 20000 blocks, 10 blocks per second, 3920475 secs remaining
Processed 30000 blocks, 12 blocks per second, 3266229 secs remaining
Processed 40000 blocks, 12 blocks per second, 3265396 secs remaining
Processed 50000 blocks, 13 blocks per second, 3013442 secs remaining
Processed 60000 blocks, 9 blocks per second, 4351639 secs remaining
Processed 70000 blocks, 9 blocks per second, 4350528 secs remaining
Processed 80000 blocks, 10 blocks per second, 3914475 secs remaining
Processed 90000 blocks, 9 blocks per second, 4348306 secs remaining
Processed 100000 blocks, 9 blocks per second, 4347195 secs remaining
Processed 110000 blocks, 9 blocks per second, 4346084 secs remaining
Processed 120000 blocks, 9 blocks per second, 4344973 secs remaining
Processed 130000 blocks, 9 blocks per second, 4343861 secs remaining
Processed 140000 blocks, 11 blocks per second, 3553159 secs remaining
Processed 150000 blocks, 9 blocks per second, 4341639 secs remaining
Processed 160000 blocks, 11 blocks per second, 3551341 secs remaining
Processed 170000 blocks, 10 blocks per second, 3905475 secs remaining
Processed 180000 blocks, 10 blocks per second, 3904475 secs remaining
Processed 190000 blocks, 10 blocks per second, 3903475 secs remaining
Processed 200000 blocks, 10 blocks per second, 3902475 secs remaining
Processed 210000 blocks, 11 blocks per second, 3546796 secs remaining
Processed 220000 blocks, 10 blocks per second, 3900475 secs remaining
Processed 230000 blocks, 10 blocks per second, 3899475 secs remaining
Processed 240000 blocks, 8 blocks per second, 4873094 secs remaining
Processed 250000 blocks, 9 blocks per second, 4330528 secs remaining
Processed 260000 blocks, 13 blocks per second, 2997289 secs remaining
Processed 270000 blocks, 23 blocks per second, 1693685 secs remaining
Processed 280000 blocks, 34 blocks per second, 1145434 secs remaining
Processed 290000 blocks, 32 blocks per second, 1216711 secs remaining
Processed 300000 blocks, 15 blocks per second, 2594983 secs remaining
Processed 310000 blocks, 10 blocks per second, 3891475 secs remaining
Processed 320000 blocks, 9 blocks per second, 4322750 secs remaining
Processed 330000 blocks, 8 blocks per second, 4861844 secs remaining
Processed 340000 blocks, 7 blocks per second, 5554965 secs remaining
Processed 350000 blocks, 8 blocks per second, 4859344 secs remaining
Processed 360000 blocks, 8 blocks per second, 4858094 secs remaining
Processed 370000 blocks, 7 blocks per second, 5550679 secs remaining
Processed 380000 blocks, 11 blocks per second, 3531341 secs remaining
Processed 390000 blocks, 29 blocks per second, 1339129 secs remaining
Processed 400000 blocks, 30 blocks per second, 1294158 secs remaining
Processed 410000 blocks, 30 blocks per second, 1293825 secs remaining
Processed 420000 blocks, 28 blocks per second, 1385884 secs remaining
Processed 430000 blocks, 20 blocks per second, 1939737 secs remaining
Processed 440000 blocks, 9 blocks per second, 4309417 secs remaining
Processed 450000 blocks, 10 blocks per second, 3877475 secs remaining
Processed 460000 blocks, 9 blocks per second, 4307195 secs remaining
Processed 470000 blocks, 8 blocks per second, 4844344 secs remaining
Processed 480000 blocks, 9 blocks per second, 4304973 secs remaining
Processed 490000 blocks, 9 blocks per second, 4303861 secs remaining
Processed 500000 blocks, 10 blocks per second, 3872475 secs remaining
Processed 510000 blocks, 9 blocks per second, 4301639 secs remaining
Processed 520000 blocks, 10 blocks per second, 3870475 secs remaining
Processed 530000 blocks, 10 blocks per second, 3869475 secs remaining
Processed 540000 blocks, 11 blocks per second, 3516796 secs remaining
Processed 550000 blocks, 10 blocks per second, 3867475 secs remaining
Processed 560000 blocks, 10 blocks per second, 3866475 secs remaining
Processed 570000 blocks, 10 blocks per second, 3865475 secs remaining
Processed 580000 blocks, 10 blocks per second, 3864475 secs remaining
Processed 590000 blocks, 10 blocks per second, 3863475 secs remaining
Processed 600000 blocks, 10 blocks per second, 3862475 secs remaining
Processed 610000 blocks, 10 blocks per second, 3861475 secs remaining
Processed 620000 blocks, 10 blocks per second, 3860475 secs remaining
Processed 630000 blocks, 9 blocks per second, 4288306 secs remaining
Processed 640000 blocks, 12 blocks per second, 3215396 secs remaining
Processed 650000 blocks, 12 blocks per second, 3214563 secs remaining
Processed 660000 blocks, 10 blocks per second, 3856475 secs remaining
Processed 670000 blocks, 9 blocks per second, 4283861 secs remaining
Processed 680000 blocks, 13 blocks per second, 2964981 secs remaining
Processed 690000 blocks, 10 blocks per second, 3853475 secs remaining
Processed 700000 blocks, 10 blocks per second, 3852475 secs remaining
Processed 710000 blocks, 10 blocks per second, 3851475 secs remaining
Processed 720000 blocks, 10 blocks per second, 3850475 secs remaining
Processed 730000 blocks, 11 blocks per second, 3499523 secs remaining
Processed 740000 blocks, 11 blocks per second, 3498614 secs remaining
Processed 750000 blocks, 11 blocks per second, 3497705 secs remaining
Processed 760000 blocks, 12 blocks per second, 3205396 secs remaining
Processed 770000 blocks, 10 blocks per second, 3845475 secs remaining
Processed 780000 blocks, 10 blocks per second, 3844475 secs remaining
Processed 790000 blocks, 10 blocks per second, 3843475 secs remaining
Processed 800000 blocks, 12 blocks per second, 3202063 secs remaining
Processed 810000 blocks, 11 blocks per second, 3492250 secs remaining
Processed 820000 blocks, 10 blocks per second, 3840475 secs remaining
Processed 830000 blocks, 11 blocks per second, 3490432 secs remaining
Processed 840000 blocks, 11 blocks per second, 3489523 secs remaining
Processed 850000 blocks, 11 blocks per second, 3488614 secs remaining
Processed 860000 blocks, 10 blocks per second, 3836475 secs remaining
Processed 870000 blocks, 10 blocks per second, 3835475 secs remaining
Processed 880000 blocks, 11 blocks per second, 3485887 secs remaining
Processed 890000 blocks, 11 blocks per second, 3484977 secs remaining
Processed 900000 blocks, 11 blocks per second, 3484068 secs remaining
Processed 910000 blocks, 10 blocks per second, 3831475 secs remaining
Processed 920000 blocks, 10 blocks per second, 3830475 secs remaining

So we see that playback speed is not uniform and sometimes reaches 305 blocks per second.

olonho commented 2 years ago

Pure neard top looks like this:

[.] rocksdb::crc32c::crc32c_3way                                                                                ◆
[.] snappy::RawUncompress                                                                                       ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] sha2::sha256::compress256                                                                                   ▒
[.] rocksdb::IndexBlockIter::SeekImpl                                                                           ▒
[.] rocksdb::ReadProperties                                                                                     ▒
[.] rocksdb::LRUHandleTable::FindPointer                                                                        ▒
[.] rocksdb::LRUCacheShard::LRU_Remove                                                                          ▒
[.] std::_Sp_counted_ptr<rocksdb::TableProperties*, (__gnu_cxx::_Lock_policy)2>::_M_dispose                     ▒
[.] rocksdb::BlockBasedTable::~BlockBasedTable                                                                  ▒
[.] rocksdb::BlockFetcher::ReadBlockContents                                                                    ▒
[.] _rjem_sdallocx                                                                                              ▒
[.] <FN as wasmer_runtime_core_near::typed_func::HostFunction<wasmer_runtime_core_near::typed_func::ExplicitVmCt▒
[.] rocksdb::LRUCacheShard::LRU_Insert                                                                          ▒
[.] _rjem_malloc                                                                                                ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] rocksdb::Version::Get                                                                                       ▒
[.] rocksdb::ShardedCache::Lookup                                                                               ▒
[.] rocksdb::TableCache::Get                                                                                    ▒
[.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<s▒
[.] rocksdb::BlockBasedTable::Get                                                                               ▒
[.] bs58::encode::encode_into                                                                                   ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] rocksdb::BlockBasedFilterBlockReader::ParseFieldsFromBlock                                                  ▒
[.] rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>                                      ▒
[.] rocksdb::DataBlockIter::SeekImpl                                                                            ▒
[.] std::__lower_bound<rocksdb::FdWithKeyRange*, rocksdb::Slice, __gnu_cxx::__ops::_Iter_comp_val<rocksdb::(anon▒
[.] <alloc::vec::Vec<T,A> as core::clone::Clone>::clone                                                         ▒
[.] rocksdb::BlockBasedFilterBlockReader::MayMatch                                     
olonho commented 2 years ago

Kernel top looks like this

32.39%  [k] copy_user_enhanced_fast_string                                                                    ◆
   3.08%  [k] find_get_pages_contig                                                                             ▒
   2.66%  [k] __lock_text_start                                                                                 ▒
   2.51%  [k] generic_file_buffered_read                                                                        ▒
   1.01%  [k] finish_task_switch                                                                                ▒
   0.92%  [k] __softirqentry_text_start                                                                         ▒
   0.61%  [k] iowrite16                                                                                         ▒
   0.47%  [k] copy_page_to_iter                                                                                 ▒
   0.34%  [k] xas_load                                                                                          ▒
   0.34%  [k] syscall_enter_from_user_mode                                                                      ▒
   0.26%  [k] __add_to_page_cache_locked                                                                        ▒
   0.25%  [k] copyout                                                                                           ▒
   0.23%  [k] mark_page_accessed                                                                                ▒
   0.21%  [k] kmem_cache_alloc                                                                                  ▒
   0.21%  [k] kmem_cache_free                                    

copy_user_enhanced_fast_string looks fishy, as it's just this guy https://elixir.bootlin.com/linux/v4.2/source/arch/x86/lib/copy_user_64.S#L216 and not so many system calls ever pass arguments as zero-terminated strings at all.

olonho commented 2 years ago

strace -fp <neard-pid> shows

[pid 371451] readahead(292, 67685523, 965 <unfinished ...>
[pid 371453] fstat(387,  <unfinished ...>
[pid 371451] <... readahead resumed>)   = 0
[pid 371453] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=48871308, ...}) = 0
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] ioctl(387, FS_IOC_GETVERSION <unfinished ...>
[pid 371451] <... pread64 resumed>"\1\314\237\243 R\303\270\237 \315\340\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67686435) = 53
[pid 371453] <... ioctl resumed>, 0x7f38203f7cb8) = 0
[pid 371451] fstat(292,  <unfinished ...>
[pid 371453] pread64(387,  <unfinished ...>
[pid 371451] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=67686488, ...}) = 0
[pid 371453] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 48871168) = 87
[pid 371447] <... pread64 resumed>"\273~\230\0(\332\1ZST\343\211'{\10\235B\231\177\177\234A\224\22:\203r\374UJ\v\320"..., 13028, 50678818) = 13028
[pid 371453] pread64(387,  <unfinished ...>
[pid 371451] ioctl(292, FS_IOC_GETVERSION <unfinished ...>
[pid 371453] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 814, 48870354) = 814
[pid 371451] <... ioctl resumed>, 0x7f3821870ac8) = 0
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371452] <... futex resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67686348) = 87
[pid 371447] <... futex resumed>)       = 1
[pid 371453] close(307 <unfinished ...>
[pid 371452] pread64(145,  <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371453] <... close resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 823, 67685525) = 823
[pid 371453] openat(AT_FDCWD, "/home/ubuntu/.near/data/3124467.sst", O_RDONLY|O_CLOEXEC) = 307
[pid 371453] fcntl(307, F_GETFD <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] <... fcntl resumed>)       = 0x1 (flags FD_CLOEXEC)
[pid 371453] fcntl(307, F_SETFD, FD_CLOEXEC) = 0
[pid 371452] <... pread64 resumed>"\320\335\265\3532\242z\310\340\241\322\v\211\7\26h\276\33-H-\4j\2321!\200\264Y\206\202\315"..., 624243, 67114489) = 624243
[pid 371451] <... pread64 resumed>"\301\215\6\10\0\r\0\r\1$\204\233\256G\250\0\276\177\0\f\r\21\1\22\f\336\303\177\2512\22\0"..., 61522, 67624003) = 61522
[pid 371453] fadvise64(307, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371453] readahead(307, 67326171, 955) = 0
[pid 371452] pread64(303,  <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371452] <... pread64 resumed>"\0\3o\0342\0\206_\0\3o\35F\213_\320e\0\4o\0364m\340\304\1\304c\0\3o\37"..., 80904, 67663977) = 80904
[pid 371453] <... pread64 resumed>"\1\212\250\215 R\235\312\213 \271\327\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67327073) = 53
[pid 371453] fstat(307,  <unfinished ...>
[pid 371452] pread64(303,  <unfinished ...>
[pid 371453] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=67327126, ...}) = 0
[pid 371453] ioctl(307, FS_IOC_GETVERSION <unfinished ...>
[pid 371451] <... pread64 resumed>"\206Cg\347\232V\317!\nuQ\241\2032\30,E6\366\350\320\10xb#\32\213\215\226\244v4"..., 499904, 67124099) = 499904
[pid 371453] <... ioctl resumed>, 0x7f38203f7cb8) = 0
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... pread64 resumed>"\25\207t\251\366\264m\276\240\215!\225\345\207rnV\"\350\202\241\364r\270@\372\206A\30,@\177"..., 553720, 67110257) = 553720
[pid 371453] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67326986) = 87
[pid 371451] close(296)                 = 0
[pid 371452] openat(AT_FDCWD, "/home/ubuntu/.near/data/3168026.sst", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... openat resumed>)      = 296
[pid 371453] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 815, 67326171) = 815
[pid 371452] fcntl(296, F_GETFD)        = 0x1 (flags FD_CLOEXEC)
[pid 371452] fcntl(296, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... fcntl resumed>)       = 0
[pid 371453] <... pread64 resumed>"\0\4\202\230+<\0\315F\0\4\202\230>Y\322F\325E\0\4\202\230F\v\254\214\1\267\363\4\0"..., 27582, 67298589) = 27582
[pid 371452] fadvise64(296, 0, 0, POSIX_FADV_RANDOM <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... fadvise64 resumed>)   = 0
[pid 371452] readahead(296, 67208543, 524288 <unfinished ...>
[pid 371453] <... pread64 resumed>"A`SU\37\2523\177=.\345.\264\363\223\31\336\5\"s\323\202\210\375\267j#\320#F'\331"..., 182477, 67116112) = 182477
[pid 371452] <... readahead resumed>)   = 0
[pid 371453] close(367 <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371453] <... close resumed>)       = 0
[pid 371452] <... pread64 resumed>"\1\323\211\246 R\352\226\241 \263\354\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67732778) = 53
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] fstat(296, {st_mode=S_IFREG|0644, st_size=67732831, ...}) = 0
[pid 371452] ioctl(296, FS_IOC_GETVERSION, 0x7f38207f83c8) = 0
[pid 371452] pread64(296, "\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67732691) = 87
[pid 371452] pread64(296, "\0$\4rocksdb.block.based.table.ind"..., 817, 67731874) = 817
[pid 371452] pread64(296, "\0\3p\300\315\0\307e\0\4p\300\347#\314e\366d\0\4p\301\6z\307\312\1\275g\0\3p"..., 79416, 67652458) = 79416
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... pread64 resumed>"\0000\363\3\0\0\0\0\0\0\0\0\204\237\276b\371\315\274\306\233\3354e\373\354:\357\314\2333\242"..., 16033, 31419416) = 16033
[pid 371451] futex(0x7f3824c09540, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 371450] <... futex resumed>)       = 0
[pid 371452] <... pread64 resumed>"\2126\261\223\7\321\305\266*\6l8\367I%\21\201\315(\16\220\340\222\241A\305 \225:b\4\246"..., 537481, 67114977) = 537481
[pid 371450] futex(0x7f3824c09540, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3222880.sst", O_RDONLY|O_CLOEXEC) = 367
[pid 371451] fcntl(367, F_GETFD <unfinished ...>
[pid 371452] close(170 <unfinished ...>
[pid 371451] <... fcntl resumed>)       = 0x1 (flags FD_CLOEXEC)
[pid 371452] <... close resumed>)       = 0
[pid 371451] fcntl(367, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... fcntl resumed>)       = 0
[pid 371451] fadvise64(367, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371451] readahead(367, 67665430, 965 <unfinished ...>
[pid 371452] <... pread64 resumed>"\2126\261\223\7\321\305\266*\6l8\367I%\21\201\315(\16\220\340\222\241A\305 \225:b\4\246"..., 537481, 67114977) = 537481
[pid 371451] <... readahead resumed>)   = 0
[pid 371453] <... pread64 resumed>"\347\336\2\230\0000\371\20\202\234\351\325\271\353\30\311\240\2705\231\305\275WZ\272\365\266\312@H\371u"..., 36391, 4628635) = 36391
[pid 371451] pread64(367,  <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... pread64 resumed>"\1\317\202\242 R\317\233\236 \302\340\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67666342) = 53
[pid 371451] fstat(367, {st_mode=S_IFREG|0644, st_size=67666395, ...}) = 0
[pid 371453] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371451] ioctl(367, FS_IOC_GETVERSION, 0x7f3821870fe8) = 0
[pid 371451] pread64(367, "\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67666255) = 87
[pid 371451] pread64(367, "\0$\4rocksdb.block.based.table.ind"..., 825, 67665430) = 825
[pid 371451] close(328)                 = 0
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3217771.sst", O_RDONLY|O_CLOEXEC) = 170
[pid 371451] fcntl(170, F_GETFD)        = 0x1 (flags FD_CLOEXEC)
[pid 371451] fcntl(170, F_SETFD, FD_CLOEXEC) = 0
[pid 371451] fadvise64(170, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371451] readahead(170, 67673586, 965) = 0
[pid 371451] pread64(170, "\1\253\302\242 R\233\327\236 \322\344\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67674498) = 53
[pid 371452] <... pread64 resumed>"\363\177\230\0(\332\1rW<\27-\217\262\277\257\313X\365\31\374\307\4*\206\4\301\375\376\240\243\4"..., 13312, 50099432) = 13312
[pid 371451] fstat(170, {st_mode=S_IFREG|0644, st_size=67674551, ...}) = 0
[pid 371451] ioctl(170, FS_IOC_GETVERSION <unfinished ...>
[pid 371452] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] <... ioctl resumed>, 0x7f3821870fe8) = 0
[pid 371451] pread64(170,  <unfinished ...>
[pid 371452] <... futex resumed>)       = 1
[pid 371447] <... futex resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67674411) = 87
[pid 371451] pread64(170,  <unfinished ...>
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 825, 67673586) = 825
[pid 371452] pread64(428,  <unfinished ...>
[pid 371451] close(295 <unfinished ...>
[pid 371447] <... futex resumed>)       = 1
[pid 371453] <... futex resumed>)       = 0
[pid 371451] <... close resumed>)       = 0
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3139501.sst", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 371453] pread64(145,  <unfinished ...>

so seems openat is an expensive guy.

olonho commented 2 years ago

Interesting, when using Dwarf based stack unwinder profile looks way saner:

+   98.46%     0.00%  neard    neard                [.] rayon_core::join::join_context::{{closure}}             ◆
+   98.45%     0.09%  neard    neard                [.] state_viewer::apply_chain_range::apply_chain_range::{{cl▒
+   94.20%     0.08%  neard    neard                [.] <near_store::db::RocksDB as near_store::db::Database>::g▒
+   93.80%     0.08%  neard    neard                [.] rocksdb_get_pinned_cf                                   ▒
+   93.73%     0.01%  neard    neard                [.] rocksdb::DBImpl::Get                                    ▒
+   93.72%     0.11%  neard    neard                [.] rocksdb::DBImpl::GetImpl                                ▒
+   93.32%     0.19%  neard    neard                [.] rocksdb::Version::Get                                   ▒
+   92.72%     0.18%  neard    neard                [.] rocksdb::TableCache::Get                                ▒
+   84.44%     0.00%  neard    neard                [.] <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Jo▒
+   68.47%     0.31%  neard    neard                [.] rocksdb::BlockFetcher::ReadBlockContents                ▒
+   56.83%     0.03%  neard    neard                [.] rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>▒
+   56.79%     0.02%  neard    neard                [.] rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockCo▒
-   56.77%     0.10%  neard    neard                [.] rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<r▒
   - 56.67% rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                      ▒
      + 54.86% rocksdb::BlockFetcher::ReadBlockContents                                                         ▒
      + 1.44% rocksdb::BlockBasedTable::PutDataBlockToCache<rocksdb::BlockContents>                             ▒
+   54.78%     0.18%  neard    neard                [.] rocksdb::BlockBasedTable::Get                           ▒
+   50.36%     0.00%  neard    [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                          ▒
+   50.24%     0.02%  neard    [kernel.kallsyms]    [k] do_syscall_64                                           ▒
+   48.27%     0.14%  neard    neard                [.] rocksdb::RandomAccessFileReader::Read                   ▒
+   48.10%     0.10%  neard    neard                [.] rocksdb::PosixRandomAccessFile::Read                    ▒
+   47.99%     0.00%  neard    libpthread-2.31.so   [.] __libc_pread (inlined)                                  ▒
+   47.17%     0.03%  neard    [kernel.kallsyms]    [k] __x64_sys_pread64                                       ▒
+   47.11%     0.01%  neard    [kernel.kallsyms]    [k] ksys_pread64                                            ▒
+   46.96%     0.03%  neard    [kernel.kallsyms]    [k] vfs_read                                                ▒
+   46.71%     0.08%  neard    [kernel.kallsyms]    [k] new_sync_read                                           ▒
+   46.64%     0.06%  neard    [kernel.kallsyms]    [k] ext4_file_read_iter                                     ▒
+   46.58%     0.02%  neard    [kernel.kallsyms]    [k] generic_file_read_iter                                  ▒
+   46.49%     2.50%  neard    [kernel.kallsyms]    [k] generic_file_buffered_read                              ▒
+   42.35%     0.14%  neard    neard                [.] rocksdb::BlockBasedFilterBlockReader::MayMatch          ▒
+   41.58%     0.05%  neard    neard                [.] rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>▒
+   37.43%     0.03%  neard    neard                [.] rocksdb::TableCache::FindTable                          ▒
+   35.51%     0.43%  neard    [kernel.kallsyms]    [k] copy_page_to_iter                                       

With dominators like

- 93.60% rocksdb::DBImpl::GetImpl                                                                                                  ▒
      - 93.32% rocksdb::Version::Get                                                                                                  ▒
         - 92.72% rocksdb::TableCache::Get                                                                                            ▒
            - 54.78% rocksdb::BlockBasedTable::Get                                                                                    ▒
               - 42.35% rocksdb::BlockBasedFilterBlockReader::MayMatch                                                                ▒
                  - 41.58% rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::GetOrReadFilterBlock                             ▒
                     - 41.52% rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::ReadFilterBlock                               ▒
                        - rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockContents>                                             ▒
                           - rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                           ▒
                              - 40.81% rocksdb::BlockFetcher::ReadBlockContents                                                       ▒
                                 - 28.43% rocksdb::RandomAccessFileReader::Read                                                       ▒
                                    - 28.41% rocksdb::PosixRandomAccessFile::Read                                                     ▒
                                       - __libc_pread (inlined)                                                                       ▒
                                          - 28.28% entry_SYSCALL_64_after_hwframe                                                     ▒
                                             - do_syscall_64                                                                          ▒
                                                - 28.25% __x64_sys_pread64                                                            ▒
                                                   - 28.24% ksys_pread64                                                              ▒
                                                      - 28.17% vfs_read                                                               ▒
                                                         - 28.12% new_sync_read                                                       ▒
                                                            - ext4_file_read_iter                                                     ▒
                                                               - generic_file_read_iter                                               ▒
                                                                  - 28.05% generic_file_buffered_read                                 ▒
                                                                     - 23.28% copy_page_to_iter                                       ▒
                                                                          22.81% copy_user_enhanced_fast_string                       ▒
                                                                     - 2.46% generic_file_buffered_read_get_pages                     ▒
                                                                          2.32% find_get_pages_contig                                 ▒
                                 - 11.23% rocksdb::VerifyBlockChecksum                                                                ▒
                                      11.21% rocksdb::crc32c::crc32c_3way                                                             ▒
                                 - 1.06% operator new                                                                                 ▒
                                      __libc_malloc (inlined)                                                                         ▒
               - 9.03% rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>                                         ▒
                  - 8.97% rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>                                                     ▒
                     - rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>                                         ▒
                        - 8.40% rocksdb::BlockFetcher::ReadBlockContents                                                              ▒
                           - 6.04% rocksdb::RandomAccessFileReader::Read                                                              ▒
                              - 5.97% rocksdb::PosixRandomAccessFile::Read                                                            ▒
                                 - 5.90% __libc_pread (inlined)                                                                       ▒
                                    - 5.70% entry_SYSCALL_64_after_hwframe                                                            ▒
                                       - 5.69% do_syscall_64                                                                          ▒
                                          - 5.66% __x64_sys_pread64                                                                   ▒
                                             - 5.64% ksys_pread64                                                                     ▒
                                                - 5.60% vfs_read                                                                      ▒
                                                   - 5.54% new_sync_read                                                              ▒
                                                      - ext4_file_read_iter                                                           ▒
                                                         - generic_file_read_iter                                                     ▒
                                                            - 5.49% generic_file_buffered_read                                        ▒
                                                               - 3.75% generic_file_buffered_read_get_pages                           ▒
                                                                  - 3.11% page_cache_sync_ra                                          ▒
                                                                     - force_page_cache_ra                                            ▒
                                                                        - do_page_cache_ra                                            ▒
                                                                           - 3.04% page_cache_ra_unbounded                            ▒
                                                                              - 2.06% read_pages                                      ▒
                                                                                 - 1.24% blk_finish_plug                              ▒
                                                                                    - 1.23% blk_flush_plug_list                       ▒
                                                                                       - blk_mq_flush_plug_list                       ▒
                                                                                          - blk_mq_sched_insert_requests               
olonho commented 2 years ago

copy_user_enhanced_fast_string is reached via

            - state_viewer::apply_chain_range::apply_chain_range::{{closure}}                                                         ▒
               - 12.70% <near_chain::store::ChainStore as near_chain::store::ChainStoreAccess>::get_block_hash_by_height              ▒
                    near_store::Store::get_ser                                                                                        ▒
                    <near_store::db::RocksDB as near_store::db::Database>::get                                                        ▒
                    rocksdb_get_pinned_cf                                                                                             ▒
                    rocksdb::DBImpl::Get                                                                                              ▒
                    rocksdb::DBImpl::GetImpl                                                                                          ▒
                    rocksdb::Version::Get                                                                                             ▒
                  - rocksdb::TableCache::Get                                                                                          ▒
                     - 12.24% rocksdb::BlockBasedTable::Get                                                                           ▒
                        - 12.05% rocksdb::BlockBasedFilterBlockReader::MayMatch                                                       ▒
                             rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::GetOrReadFilterBlock                           ▒
                             rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::ReadFilterBlock                                ▒
                             rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockContents>                                          ▒
                             rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                           ▒
                             rocksdb::BlockFetcher::ReadBlockContents                                                                 ▒
                             rocksdb::RandomAccessFileReader::Read                                                                    ▒
                             rocksdb::PosixRandomAccessFile::Read                                                                     ▒
                             __libc_pread (inlined)                                                                                   ▒
                             entry_SYSCALL_64_after_hwframe                                                                           ▒
                             do_syscall_64                                                                                            ▒
                             __x64_sys_pread64                                                                                        ▒
                             ksys_pread64                                                                                             ▒
                             vfs_read                                                                                                 ▒
                             new_sync_read                                                                                            ▒
                             ext4_file_read_iter                                                                                      ▒
                             generic_file_read_iter                                                                                   ▒
                             generic_file_buffered_read                                                                               ▒
                             copy_page_to_iter                                                                                        ▒
                             copy_user_enhanced_fast_string                   
bowenwang1996 commented 2 years ago

@olonho so basically it is within rocksdb?

olonho commented 2 years ago

Yes, seems so

olonho commented 2 years ago

Further investigations show pretty mediocre IO performance on the system. Namely iotop tool shows the following:

Total DISK READ:        11.47 M/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:      11.47 M/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                               
 378500 be/4 ubuntu      2.31 M/s    0.00 B/s  0.00 % 10.75 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378501 be/4 ubuntu      2.02 M/s    0.00 B/s  0.00 %  8.78 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378503 be/4 ubuntu   1897.52 K/s    0.00 B/s  0.00 %  8.73 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378505 be/4 ubuntu   1670.11 K/s    0.00 B/s  0.00 %  7.43 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378504 be/4 ubuntu   1267.50 K/s    0.00 B/s  0.00 %  5.61 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378502 be/4 ubuntu   1356.97 K/s    0.00 B/s  0.00 %  5.25 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378499 be/4 ubuntu    935.71 K/s    0.00 B/s  0.00 %  3.47 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378506 be/4 ubuntu    182.67 K/s    0.00 B/s  0.00 %  0.58 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
      1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --system --deserialize 27
      2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]

so we generally get 12M/sec disk drive performance. Compared to https://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/8 it hints that we get performance 3x worse than slowest of SSD drives in this list. Just for reference compared that to what Samsung T5 external SSD in my box can give and get results around 150M/sec (queue depth 64, using AmorphousDiskMark), or 15-20M/sec with queue depth 1. Queue depth of 8 shown 118M/sec. Internal SSD shows beefy 331M/sec.

olonho commented 2 years ago

https://cloud.google.com/compute/docs/disks/performance suggests that for GCP instance expected read throughput per instance is between 200MByte/sec and 9GByte/sec.

olonho commented 2 years ago

And some nodes (archive backup) shows actual peak read performance around 500Mb/sec per gcloud console.

bowenwang1996 commented 2 years ago

Interesting. @olonho do you think the issue is specific to the instance you use to run the history playback tool? Is there some disk IO benchmark we can run to see whether the instance itself is problematic?

janewang commented 2 years ago

Moving to @marcelo-gonzalez for the OKR ownership transfer

bowenwang1996 commented 2 years ago

Moving to @marcelo-gonzalez for the OKR ownership transfer

I don't think this is in the OKR?