Closed hmusum closed 1 month ago
Note: The improvement needed has been known, but impact was not expected to be this large. Note that nodes eventually converge, but it takes longer than necessary.
I've printed out the debug log for filedistribution package on 3 config servers, and 1 container server, and sorted them in chronological order. I also added some comments about my understanding of what happened. Hope this analysis can help pinpoint the cause of this issue.
# I have 3 config servers node0 (172.19.0.7), node1 (172.19.0.8), node2 (172.19.0.2) and 1 container node4 (172.19.0.3) in this setup
# Session 49 was created on node0, after the session was active, node4 need to download two json files `50b8474089c54d6/marqo_index_settings.json` and `569b5a7c041bcb4a/marqo_index_settings_history.json` to reload the components
# In this scenario, node4 managed to download the first file immediately from node1 relayed to node0, but waited 20s to download the second file.
# In the worst scenario, it waited 20s for each document, taking 40s in total to download both of them
# session 49 being prepared on node0
1723647539.683213 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug Adding file /opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/49 (created 2024-08-14T14:58:59.078960009Z, modified 2024-08-14T14:58:59.310960009Z, size 4096)
1723647539.683317 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug Copying /opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/49 to /opt/vespa/var/db/vespa/filedistribution/writing5797371890501607843/49
1723647539.689365 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug Moving /opt/vespa/var/db/vespa/filedistribution/writing5797371890501607843 to /opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c
# node0 notify node1 and node2 to download new file references
1723647539.689777 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDistributionImpl debug Executing filedistribution.setFileReferencesToDownload against Connection { no socket, spec tcp/node1.vespanet:19070 }
1723647539.690007 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDistributionImpl debug Executing filedistribution.setFileReferencesToDownload against Connection { no socket, spec tcp/node2.vespanet:19070 }
# node1 tried to download the application package `bed71968b74fd6c/49` from node2 but failed since it does not exist on onde2
1723647539.698372 node1.vespanet 16/88 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file 'bed71968b74fd6c' with timeout PT10S
1723647539.701792 node2.vespanet 16/94 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file 'bed71968b74fd6c'' from Connection { Socket[addr=/172.19.0.8,port=37508,localport=19070] }
1723647539.708345 node1.vespanet 16/241 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(bed71968b74fd6c,1,[zstd, gzip, lz4]): 1,File reference not found\nSpec: tcp/node2.vespanet:19070
1723647539.709023 node1.vespanet 16/241 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug file 'bed71968b74fd6c' not found or timed out (error code 1) at tcp/node2.vespanet:19070
# meanwhile, node2 downloaded the application package `bed71968b74fd6c/49` from node0
1723647539.703192 node2.vespanet 16/255 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference 'bed71968b74fd6c' ('/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c') does not exist.
1723647539.703245 node2.vespanet 16/255 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug File not found, will not download from another source
1723647539.701879 node2.vespanet 16/95 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file 'bed71968b74fd6c' with timeout PT10S
1723647539.703941 node0.vespanet 16/82 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file 'bed71968b74fd6c'' from Connection { Socket[addr=/172.19.0.2,port=55434,localport=19070] }
1723647539.704806 node0.vespanet 16/223 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug accepted compression types=[gzip, lz4, zstd], compression type to use=zstd
1723647539.706402 node0.vespanet 16/223 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Compressing with compression type zstd
1723647539.710774 node0.vespanet 16/223 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving bed71968b74fd6c with file '/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c/49'
1723647539.715731 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileMeta' with parameters : bed71968b74fd6c,49,compressed,56278,zstd
1723647539.723542 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFilePart' with parameters : bed71968b74fd6c,59,0,[B@73c6eb1a
1723647539.735291 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam 100.0 percent of 'bed71968b74fd6c' downloaded
1723647539.736114 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileEof' with parameters : bed71968b74fd6c,59,6645634188661618573,0,OK
1723647539.736950 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam compression type to use=zstd
1723647539.737450 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Decompressing '/opt/vespa/var/db/vespa/filedistribution/4911376584720152540704.inprogress' into '/opt/vespa/var/db/vespa/filedistribution/archive5191366685375228574'
1723647539.737576 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Decompressing with compression type zstd
1723647539.749826 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam File moved from /opt/vespa/var/db/vespa/filedistribution/archive5191366685375228574 to /opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c
1723647539.750554 node0.vespanet 16/223 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving bed71968b74fd6c with file '/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c/49'
1723647539.755955 node2.vespanet 16/256 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(bed71968b74fd6c,1,[gzip, zstd, lz4]): 0,OK\nSpec: tcp/node0.vespanet:19070
1723647539.756640 node2.vespanet 16/256 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Found file 'bed71968b74fd6c' available at tcp/node0.vespanet:19070
1723647539.796392 node1.vespanet 16/349 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference 'bed71968b74fd6c' ('/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c') does not exist.
1723647539.892435 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.ApplicationRepository info Session 49 prepared successfully.
1723647539.913341 node1.vespanet 16/349 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference 'bed71968b74fd6c' ('/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c') does not exist.
1723647540.095355 node1.vespanet 16/349 configserver Container.com.yahoo.vespa.config.server.session.SessionRepository info Session activated: 49
1723647540.143312 node2.vespanet 16/358 configserver Container.com.yahoo.vespa.config.server.session.SessionRepository info Session activated: 49
# node4 managed to download `50b8474089c54d6/marqo_index_settings.json` from node1 relayed to node0, this is done right after the session is active. no delay
1723647540.152869 node4.vespanet 133/38 container Container.com.yahoo.filedistribution.fileacquirer.FileAcquirerImpl debug InvokeSync waitFor file '50b8474089c54d6' with 60.0 seconds timeout
1723647540.157769 node1.vespanet 16/83 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '50b8474089c54d6'' from Connection { Socket[addr=/172.19.0.3,port=45510,localport=19070] }
1723647540.158814 node1.vespanet 16/220 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference '50b8474089c54d6' ('/opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6') does not exist.
1723647540.158924 node1.vespanet 16/220 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug File not found, downloading from another source
1723647540.159077 node1.vespanet 16/220 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file '50b8474089c54d6' with timeout PT10S
1723647540.176019 node0.vespanet 16/83 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '50b8474089c54d6'' from Connection { Socket[addr=/172.19.0.8,port=46240,localport=19070] }
1723647540.177374 node0.vespanet 16/2290 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving 50b8474089c54d6 with file '/opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json'
1723647540.178615 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileMeta' with parameters : 50b8474089c54d6,marqo_index_settings.json,file,1378
1723647540.182949 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFilePart' with parameters : 50b8474089c54d6,60,0,[B@5c128707
1723647540.184885 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam 100.0 percent of '50b8474089c54d6' downloaded
1723647540.185348 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileEof' with parameters : 50b8474089c54d6,60,-1390720545112173406,0,OK
1723647540.185576 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver debug Uncompressed file, moving to /opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json
1723647540.185731 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam File moved from /opt/vespa/var/db/vespa/filedistribution/marqo_index_settings.json4170593771066296984.inprogress to /opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json
1723647540.185851 node0.vespanet 16/2290 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving 50b8474089c54d6 with file '/opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json'
1723647540.185932 node1.vespanet 16/220 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving 50b8474089c54d6 with file '/opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json'
1723647540.186289 node1.vespanet 16/253 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(50b8474089c54d6,1,[zstd, gzip, lz4]): 0,OK\nSpec: tcp/node0.vespanet:19070
1723647540.186666 node1.vespanet 16/253 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Found file '50b8474089c54d6' available at tcp/node0.vespanet:19070
1723647540.194317 node1.vespanet 16/220 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving 50b8474089c54d6 with file '/opt/vespa/var/db/vespa/filedistribution/50b8474089c54d6/marqo_index_settings.json'
# node4 failed to download `569b5a7c041bcb4a/marqo_index_settings_history.json` from node2 relayed to node1, since node1 does not have it, the connection hung
1723647540.197927 node4.vespanet 133/38 container Container.com.yahoo.filedistribution.fileacquirer.FileAcquirerImpl debug InvokeSync waitFor file '569b5a7c041bcb4a' with 60.0 seconds timeout
1723647540.202578 node2.vespanet 16/88 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '569b5a7c041bcb4a'' from Connection { Socket[addr=/172.19.0.3,port=55216,localport=19070] }
1723647540.204446 node2.vespanet 16/271 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference '569b5a7c041bcb4a' ('/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a') does not exist.
1723647540.204761 node2.vespanet 16/271 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug File not found, downloading from another source
1723647540.204951 node2.vespanet 16/271 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file '569b5a7c041bcb4a' with timeout PT10S
1723647540.206795 node1.vespanet 16/84 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '569b5a7c041bcb4a'' from Connection { Socket[addr=/172.19.0.2,port=52412,localport=19070] }
1723647540.207608 node1.vespanet 16/236 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference '569b5a7c041bcb4a' ('/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a') does not exist.
1723647540.207679 node1.vespanet 16/236 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug File not found, will not download from another source
1723647540.208335 node2.vespanet 16/252 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(569b5a7c041bcb4a,1,[gzip, zstd, lz4]): 1,File reference not found\nSpec: tcp/node1.vespanet:19070
1723647540.208555 node2.vespanet 16/252 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug file '569b5a7c041bcb4a' not found or timed out (error code 1) at tcp/node1.vespanet:19070
# node0 logs out of order somehow
1723647540.228413 node0.vespanet 16/43789 configserver Container.com.yahoo.vespa.config.server.session.SessionRepository info Session activated: 49
1723647540.228779 node0.vespanet 16/113 configserver Container.com.yahoo.vespa.config.server.deploy.Deployment info Session 49 activated successfully using no host provisioner. Config generation 49. Based on session 48. File references: [file '569b5a7c041bcb4a', file 'bed71968b74fd6c', file '4bd190d76a4bfef5', file '50b8474089c54d6']
# node0 failed to setFileReferencesToDownload due to timeout. This is fine since it's just a hint
1723647540.910380 node0.vespanet 16/89 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDistributionImpl debug filedistribution.setFileReferencesToDownload failed for Connection { Socket[addr=node1.vespanet/172.19.0.8,port=19070,localport=33946] }: 103 (Request timed out after 1.0 seconds.)
1723647540.910598 node0.vespanet 16/89 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDistributionImpl debug filedistribution.setFileReferencesToDownload failed for Connection { Socket[addr=node2.vespanet/172.19.0.2,port=19070,localport=34126] }: 103 (Request timed out after 1.0 seconds.)
# node1 downloads the application package `bed71968b74fd6c/49` from node2 (this happened at +2s, which could be a scheduled run)
1723647542.378126 node1.vespanet 16/143 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file 'bed71968b74fd6c' with timeout PT1M
1723647542.380332 node2.vespanet 16/91 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file 'bed71968b74fd6c'' from Connection { Socket[addr=/172.19.0.8,port=41236,localport=19070] }
1723647542.383030 node2.vespanet 16/274 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug accepted compression types=[zstd, gzip, lz4], compression type to use=zstd
1723647542.384302 node2.vespanet 16/274 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Compressing with compression type zstd
1723647542.396565 node2.vespanet 16/274 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving bed71968b74fd6c with file '/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c/49'
1723647542.396979 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileMeta' with parameters : bed71968b74fd6c,49,compressed,56278,zstd
1723647542.400326 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFilePart' with parameters : bed71968b74fd6c,11,0,[B@134337ca
1723647542.400686 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam 100.0 percent of 'bed71968b74fd6c' downloaded
1723647542.400880 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileEof' with parameters : bed71968b74fd6c,11,6645634188661618573,0,OK
1723647542.401184 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam compression type to use=zstd
1723647542.401509 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Decompressing '/opt/vespa/var/db/vespa/filedistribution/4912788838635265955185.inprogress' into '/opt/vespa/var/db/vespa/filedistribution/archive15723724945497459670'
1723647542.401577 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceCompressor spam Decompressing with compression type zstd
1723647542.405977 node1.vespanet 16/145 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam File moved from /opt/vespa/var/db/vespa/filedistribution/archive15723724945497459670 to /opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c
1723647542.406319 node2.vespanet 16/274 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving bed71968b74fd6c with file '/opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c/49'
1723647542.406636 node1.vespanet 16/288 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(bed71968b74fd6c,1,[zstd, gzip, lz4]): 0,OK\nSpec: tcp/node2.vespanet:19070
1723647542.406757 node1.vespanet 16/288 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Found file 'bed71968b74fd6c' available at tcp/node2.vespanet:19070
# not sure why there's a log here from node1, the timing is about 10s after requesting this from node2 in the first place
1723647549.702861 node1.vespanet 16/241 configserver Container.com.yahoo.vespa.filedistribution.FileDownloader debug File reference 'bed71968b74fd6c' found: /opt/vespa/var/db/vespa/filedistribution/bed71968b74fd6c/49
# After 10 seconds, node2 returned the result to node4 indicating the file `569b5a7c041bcb4a/marqo_index_settings_history.json` cannot be downloaded
1723647550.206354 node2.vespanet 16/271 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer info Failed downloading 'file '569b5a7c041bcb4a', client: Connection { Socket[addr=/172.19.0.3,port=55216,localport=19070] }'
# node2 then sent request to node0 to download the second file `569b5a7c041bcb4a/marqo_index_settings_history.json`
1723647550.207469 node0.vespanet 16/82 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '569b5a7c041bcb4a'' from Connection { Socket[addr=/172.19.0.2,port=55434,localport=19070] }
1723647550.208281 node0.vespanet 16/238 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
1723647550.208542 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileMeta' with parameters : 569b5a7c041bcb4a,marqo_index_settings_history.json,file,3436
1723647550.209273 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFilePart' with parameters : 569b5a7c041bcb4a,60,0,[B@769f6330
1723647550.209588 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam 100.0 percent of '569b5a7c041bcb4a' downloaded
1723647550.209740 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileEof' with parameters : 569b5a7c041bcb4a,60,7130568599363028354,0,OK
1723647550.210209 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver debug Uncompressed file, moving to /opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json
1723647550.210316 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam File moved from /opt/vespa/var/db/vespa/filedistribution/marqo_index_settings_history.json3811455371025104079.inprogress to /opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json
1723647550.212044 node2.vespanet 16/86 configserver Container.com.yahoo.vespa.filedistribution.Downloads debug Received 'file '569b5a7c041bcb4a'', which was not requested. Can be ignored if happening during upgrades/restarts
1723647550.212197 node0.vespanet 16/238 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
1723647550.212429 node2.vespanet 16/252 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(569b5a7c041bcb4a,1,[gzip, zstd, lz4]): 0,OK\nSpec: tcp/node0.vespanet:19070
1723647550.212488 node2.vespanet 16/252 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Found file '569b5a7c041bcb4a' available at tcp/node0.vespanet:19070
# After another 10s, node4 sent request to node1 to download the second file `569b5a7c041bcb4a/marqo_index_settings_history.json`, it got relayed to node0 and succeeded finally
1723647560.211108 node1.vespanet 16/83 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '569b5a7c041bcb4a'' from Connection { Socket[addr=/172.19.0.3,port=45510,localport=19070] }
1723647560.211951 node1.vespanet 16/237 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileDirectory debug File reference '569b5a7c041bcb4a' ('/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a') does not exist.
1723647560.212113 node1.vespanet 16/237 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug File not found, downloading from another source
1723647560.212320 node1.vespanet 16/237 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Will download file '569b5a7c041bcb4a' with timeout PT10S
1723647560.215783 node0.vespanet 16/83 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Received request for file reference 'file '569b5a7c041bcb4a'' from Connection { Socket[addr=/172.19.0.8,port=46240,localport=19070] }
1723647560.216695 node0.vespanet 16/2291 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
1723647560.217191 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileMeta' with parameters : 569b5a7c041bcb4a,marqo_index_settings_history.json,file,3436
1723647560.220182 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFilePart' with parameters : 569b5a7c041bcb4a,61,0,[B@6e431602
1723647560.220868 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam 100.0 percent of '569b5a7c041bcb4a' downloaded
1723647560.221278 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam Received method call 'filedistribution.receiveFileEof' with parameters : 569b5a7c041bcb4a,61,7130568599363028354,0,OK
1723647560.221570 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver debug Uncompressed file, moving to /opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json
1723647560.221853 node1.vespanet 16/79 configserver Container.com.yahoo.vespa.filedistribution.FileReceiver spam File moved from /opt/vespa/var/db/vespa/filedistribution/marqo_index_settings_history.json4975732854662624961.inprogress to /opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json
1723647560.222194 node0.vespanet 16/2291 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
1723647560.222213 node1.vespanet 16/237 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Start serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
1723647560.222540 node1.vespanet 16/242 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Request callback, OK. Req: request filedistribution.serveFile(569b5a7c041bcb4a,1,[zstd, gzip, lz4]): 0,OK\nSpec: tcp/node0.vespanet:19070
1723647560.222710 node1.vespanet 16/242 configserver Container.com.yahoo.vespa.filedistribution.FileReferenceDownloader debug Found file '569b5a7c041bcb4a' available at tcp/node0.vespanet:19070
1723647560.229310 node1.vespanet 16/237 configserver Container.com.yahoo.vespa.config.server.filedistribution.FileServer debug Done serving 569b5a7c041bcb4a with file '/opt/vespa/var/db/vespa/filedistribution/569b5a7c041bcb4a/marqo_index_settings_history.json'
# node4 updates the SlobrokList, not sure if this is related
1723647560.309675 node4.vespanet 133/38 container Container.com.yahoo.jrt.slobrok.api.SlobrokList debug checkUpdate() updated tmp list=[tcp/node2.vespanet:19099, tcp/node0.vespanet:19099, tcp/node1.vespanet:19099] from shared list=[tcp/node2.vespanet:19099, tcp/node0.vespanet:19099, tcp/node1.vespanet:19099]
1723647560.309806 node4.vespanet 133/38 container Container.com.yahoo.jrt.slobrok.api.SlobrokList debug nextSlobrokSpec() returns: tcp/node2.vespanet:19099
1723647560.410254 node4.vespanet 133/1962 container Container.com.yahoo.jrt.slobrok.api.SlobrokList debug checkUpdate() updated tmp list=[tcp/node2.vespanet:19099, tcp/node0.vespanet:19099, tcp/node1.vespanet:19099] from shared list=[tcp/node2.vespanet:19099, tcp/node0.vespanet:19099, tcp/node1.vespanet:19099]
1723647560.410332 node4.vespanet 133/1962 container Container.com.yahoo.jrt.slobrok.api.SlobrokList debug nextSlobrokSpec() returns: tcp/node2.vespanet:19099
There are a few weird behaviour in this scenario:
I think I found the cause of this issue! it's due to the combination of timeouts and retry backoff. The default (and maybe hardcoded) timeout for downloading a file/directory is 10s, and sleepBetweenRetries is 5s. We calculate the backoff using this logic. Which means the first retry will wait 10s, or slightly less than 10s, since Duration.between(Instant.now(), end)
is less than 10s.
This matches the weird behaviour I mentioned above. node2 waited 10s to retry a request on node0 after failing on node1. node4 waited 10s to retry a request on node1 after failing on node2.
To fix this, we'll need to allow configuring sleepBetweenRetries
to a smaller value. And also use Math.pow(2, retryCount - 1)) * sleepBetweenRetries.toMillis() since we want the first retry to sleep
sleepBetweenRetriesinstead of
2 * sleepBetweenRetries`. In my case, I'll need to set this to be only 100ms.
Another quick fix would be to allow immediate retry after the first failure, on this line, instead of checking if (retry > 0)
, we could check if (retry > 1)
, only backoff in the second retry. This will solve the wait issue with 3 config servers, since if the first request misses, the second request to another server will always hit the target. This will be suffice to achieve sub-second convergence.
Would there be any unwanted side effects if we make this change? Keen to know your thoughts on this.
Hello @hmusum . Thanks for addressing this issue so promptly. I've tested version 8.395.28
with the config VESPA_FILE_DOWNLOAD_BACKOFF_INITIAL_TIME_MS=100
. and verified this long wait issue has been fixed. Please feel free to close this ticket.
@papa99do Thanks, closing
Reported on Slack: