skyplane-project / skyplane

🔥 Blazing fast bulk data transfers between any cloud 🔥
https://skyplane.org
Apache License 2.0
1.08k stars 62 forks source link

[bug] Transfer crash with bug #802

Closed killerdbob closed 1 year ago

killerdbob commented 1 year ago

Describe the bug when copy, the gateway cannot work correctly

Transfer client log In the log output from Skyplane, please upload the debug log from the CLI. You can find the path to the file in the log output:

2023-04-19T16:09:46.687851954+08:00 08:09:46 [INFO]  Chunk 5d5a03f00f754760992622d942958e4e state transition 
2023-04-19T16:09:46.687860787+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
2023-04-19T16:09:46.688100474+08:00 08:09:46 [INFO]  Chunk b98fa26edb754a5188a1a89aa98254a2 state transition 
2023-04-19T16:09:46.688112623+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
2023-04-19T16:09:46.690637358+08:00 08:09:46 [INFO]   started new server connection to 124.71.208.7:55645
2023-04-19T16:09:46.691361526+08:00 08:09:46 [DEBUG] :['a97f70a5c4764f038ff643b706dcca68'] created new socket
2023-04-19T16:09:46.692079799+08:00 08:09:46 [INFO]  Chunk 376502b09e7d4b9290c63da348945b9b state transition 
2023-04-19T16:09:46.692087505+08:00 ChunkState.download_in_progress -> ChunkState.downloaded
2023-04-19T16:09:46.692465768+08:00 08:09:46 [INFO]  Queuing download for chunk 751a6198baa4459ea009c62f3b3d8982 
2023-04-19T16:09:46.692473557+08:00 (state=ChunkState.registered
2023-04-19T16:09:46.692876745+08:00 08:09:46 [ERROR]  Exception: chunk 5d5a03f00f754760992622d942958e4e has size 
2023-04-19T16:09:46.692884815+08:00 5747628 but should be 10530318
2023-04-19T16:09:46.692887081+08:00 08:09:46 [INFO]  Chunk a97f70a5c4764f038ff643b706dcca68 state transition 
2023-04-19T16:09:46.692889136+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
2023-04-19T16:09:46.693366642+08:00 08:09:46 [ERROR]  Exception: chunk b98fa26edb754a5188a1a89aa98254a2 has size 
2023-04-19T16:09:46.693382406+08:00 5579731 but should be 10487789
2023-04-19T16:09:46.693601119+08:00 08:09:46 [INFO]  Chunk 751a6198baa4459ea009c62f3b3d8982 state transition 
2023-04-19T16:09:46.693608017+08:00 ChunkState.registered -> ChunkState.download_queued
2023-04-19T16:09:46.694987253+08:00 08:09:46 [INFO]  Chunk 751a6198baa4459ea009c62f3b3d8982 state transition 
2023-04-19T16:09:46.695000838+08:00 ChunkState.download_queued -> ChunkState.download_in_progress
2023-04-19T16:09:46.695356622+08:00 08:09:46 [INFO]   exiting, closing sockets

Environment info (please complete the following information):

Additional context Add any other context about the problem here.

killerdbob commented 1 year ago

2023-04-19T16:09:46.692876745+08:00 08:09:46 [ERROR] Exception: chunk 5d5a03f00f754760992622d942958e4e has size 2023-04-19T16:09:46.692884815+08:00 5747628 but should be 10530318

killerdbob commented 1 year ago

and some other bug appear, Exception: [Errno 32] Broken pipe.

2023-04-19T16:09:46.689287614+08:00 08:09:46 [INFO]   all chunks reached state 'downloaded'
2023-04-19T16:09:46.689291872+08:00 08:09:46 [DEBUG] :2193dfd3f0054ddeaefa663134466bd1 sent chunk header
2023-04-19T16:09:46.689617183+08:00 08:09:46 [INFO]   waiting for chunks to reach state 'downloaded'
2023-04-19T16:09:46.689785203+08:00 08:09:46 [INFO]   exiting, closing servers
2023-04-19T16:09:46.690131174+08:00 08:09:46 [INFO]   all chunks reached state 'downloaded'
2023-04-19T16:09:46.690583441+08:00 08:09:46 [INFO]   exiting, closing servers
2023-04-19T16:09:46.693980180+08:00 08:09:46 [DEBUG] :9e8ea35b00d74357aa72ce6d3d11771d sending chunk header
2023-04-19T16:09:46.694755476+08:00 08:09:46 [DEBUG] :9e8ea35b00d74357aa72ce6d3d11771d sent chunk header
2023-04-19T16:09:46.699450873+08:00 08:09:46 [INFO]   started new server connection to 116.204.80.107:60259
2023-04-19T16:09:46.700067323+08:00 08:09:46 [DEBUG] :['1ca294320ada46578796b20c060279ed'] created new socket
2023-04-19T16:09:46.701324445+08:00 08:09:46 [ERROR]  Exception: [Errno 32] Broken pipe
2023-04-19T16:09:46.702534046+08:00 08:09:46 [INFO]  Chunk 1ca294320ada46578796b20c060279ed state transition 
2023-04-19T16:09:46.702543872+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
2023-04-19T16:09:46.706626482+08:00 08:09:46 [INFO]   started new server connection to 116.204.80.107:40485
2023-04-19T16:09:46.707418398+08:00 08:09:46 [ERROR]  Exception: [Errno 32] Broken pipe
2023-04-19T16:09:46.707426469+08:00 08:09:46 [DEBUG] :['39e1a36a7121459ba9197280203c991f'] created new socket
2023-04-19T16:09:46.708700987+08:00 08:09:46 [INFO]   exiting, closing sockets
2023-04-19T16:09:46.709356945+08:00 08:09:46 [INFO]   waiting for chunks to reach state 'downloaded'
2023-04-19T16:09:46.709395814+08:00 08:09:46 [INFO]  Chunk 39e1a36a7121459ba9197280203c991f state transition 
2023-04-19T16:09:46.709400734+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
2023-04-19T16:09:46.709838500+08:00 08:09:46 [INFO]   all chunks reached state 'downloaded'
2023-04-19T16:09:46.710322146+08:00 08:09:46 [INFO]   exiting, closing servers
2023-04-19T16:09:46.713415912+08:00 08:09:46 [ERROR]  Exception: [Errno 32] Broken pipe
2023-04-19T16:09:46.714350614+08:00 08:09:46 [INFO]   exiting, closing sockets
2023-04-19T16:09:46.715225215+08:00 08:09:46 [INFO]   waiting for chunks to reach state 'downloaded'
2023-04-19T16:09:46.715253205+08:00 08:09:46 [INFO]   closed destination socket 116.204.113.253:42085
2023-04-19T16:09:46.715578675+08:00 08:09:46 [ERROR]  Exception: [Errno 32] Broken pipe
2023-04-19T16:09:46.715806061+08:00 08:09:46 [INFO]   all chunks reached state 'downloaded'
2023-04-19T16:09:46.715923171+08:00 08:09:46 [DEBUG] :335168d4a537478399ee83baa30cbd63 sending chunk header
2023-04-19T16:09:46.716279309+08:00 08:09:46 [INFO]   exiting, closing servers
2023-04-19T16:09:46.716628232+08:00 08:09:46 [DEBUG] :335168d4a537478399ee83baa30cbd63 sent chunk header
2023-04-19T16:09:46.718608049+08:00 08:09:46 [INFO]   exiting, closing sockets
2023-04-19T16:09:46.719211520+08:00 08:09:46 [INFO]   started new server connection to 116.204.80.107:34069
2023-04-19T16:09:46.719488853+08:00 08:09:46 [INFO]   waiting for chunks to reach state 'downloaded'
2023-04-19T16:09:46.719816859+08:00 08:09:46 [DEBUG] :['f6614cdd9dd34971872ba11671058abe'] created new socket
2023-04-19T16:09:46.720228880+08:00 08:09:46 [INFO]   all chunks reached state 'downloaded'
2023-04-19T16:09:46.720652008+08:00 08:09:46 [INFO]   exiting, closing sockets
sarahwooders commented 1 year ago

@killerdbob was this issue resolved? I believe its an concurrency error that should be fixed with #803

killerdbob commented 1 year ago

The problem still exists, I am trying to fix it.

killerdbob commented 1 year ago
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:cbc30f49b7904be6b479410b6b725808.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:cbc30f49b7904be6b479410b6b725808.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:65736523c8b148c89a8c8e0a0385d228.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:65736523c8b148c89a8c8e0a0385d228.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:75837233eb084fd3ab9fcc502a3395b6.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:75837233eb084fd3ab9fcc502a3395b6.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:6d7824e1a8304e64bd4bae36348ad1c8.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:e096aad5ed724c648c8d3774dade75d1.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:6d7824e1a8304e64bd4bae36348ad1c8.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:e096aad5ed724c648c8d3774dade75d1.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:0c45f0193ee04b5c8fa4d1351c8334d6.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:9b61b00cf19545769585eabfac2a5729.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:9b61b00cf19545769585eabfac2a5729.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:d9e6d104f6a349a7a7b3d5c5cb708cdd.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-east-3:d9e6d104f6a349a7a7b3d5c5cb708cdd.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_ali:cn-heyuan:f980561b5214491e870f2fce8db8b054.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:0c45f0193ee04b5c8fa4d1351c8334d6.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_ali:cn-heyuan:f980561b5214491e870f2fce8db8b054.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:58df0fcadf2e4b8c803cea280a1ea3f7.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-north-9:fec18820e34b49888cc498d3756d558c.stdout
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-north-9:fec18820e34b49888cc498d3756d558c.stderr
Downloading log: /tmp/skyplane/transfer_logs/20230423_210900/gateway_hw:cn-south-1:58df0fcadf2e4b8c803cea280a1ea3f7.stderr
Exception in thread Thread-149:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/PCL-skyplane/skyplane/api/tracker.py", line 174, in run
    raise err
  File "/home/PCL-skyplane/skyplane/api/tracker.py", line 163, in run
    self.monitor_transfer()
  File "/home/PCL-skyplane/skyplane/utils/imports.py", line 33, in wrapped
    return fn(*modules_imported, *args, **kwargs)
  File "/home/PCL-skyplane/skyplane/api/tracker.py", line 228, in monitor_transfer
    raise exceptions.SkyplaneGatewayException("Transfer failed with errors", errors)
skyplane.exceptions.SkyplaneGatewayException: Transfer failed with errors
⠙ Transfer progress ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/7.8 GiB ? -:--:--
killerdbob commented 1 year ago

I follow your request #803 commit, the size still not correct.

root@b6d8eb95476b:/tmp/skyplane/transfer_logs/20230423_210900# grep -Rn "chunk 98e19c15e55c4f3fa6fcdbd265875de5 has size" -A 10 -B 10
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2349-2023-04-23T21:19:24.696762200+08:00 60.204.158.176: 1.16s
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2350-2023-04-23T21:19:24.697069197+08:00 13:19:24 [DEBUG] :['d05ef2e9bd274aac812ab70d968bdda5'] registered chunks
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2351-2023-04-23T21:19:24.697390582+08:00 13:19:24 [DEBUG] :['a1963771c4814b919397313179915c25'] creating new socket
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2352-2023-04-23T21:19:24.697884700+08:00 13:19:24 [DEBUG] pre-register chunks ['d05ef2e9bd274aac812ab70d968bdda5'] to 
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2353-2023-04-23T21:19:24.697894987+08:00 60.204.158.176: 1.16s
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2354-2023-04-23T21:19:24.698526200+08:00 13:19:24 [DEBUG] :['d05ef2e9bd274aac812ab70d968bdda5'] creating new socket
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2355-2023-04-23T21:19:24.764114909+08:00 13:19:24 [INFO]   started new server connection to 60.204.150.46:32857
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2356-2023-04-23T21:19:24.764696475+08:00 13:19:24 [DEBUG] :['98e19c15e55c4f3fa6fcdbd265875de5'] created new socket
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2357-2023-04-23T21:19:24.766155791+08:00 13:19:24 [INFO]  Chunk 98e19c15e55c4f3fa6fcdbd265875de5 state transition 
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2358-2023-04-23T21:19:24.766163625+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr:2359:2023-04-23T21:19:24.771282545+08:00 13:19:24 [ERROR]  Exception: chunk 98e19c15e55c4f3fa6fcdbd265875de5 has size 
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2360-2023-04-23T21:19:24.771291423+08:00 5679336 but should be 10660216
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2361-2023-04-23T21:19:24.773584162+08:00 13:19:24 [INFO]   exiting, closing sockets
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2362-2023-04-23T21:19:24.774190150+08:00 13:19:24 [INFO]   exiting
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2363-2023-04-23T21:19:24.774200837+08:00 13:19:24 [INFO]   exiting
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2364-2023-04-23T21:19:24.774203447+08:00 13:19:24 [INFO]   waiting for chunks to reach state 'downloaded'
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2365-2023-04-23T21:19:24.774206160+08:00 13:19:24 [INFO]   exiting, closing sockets
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2366-2023-04-23T21:19:24.774220385+08:00 13:19:24 [INFO]   exiting
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2367-2023-04-23T21:19:24.774236778+08:00 13:19:24 [INFO]   exiting
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2368-2023-04-23T21:19:24.774239603+08:00 13:19:24 [INFO]   exiting, closing sockets
gateway_hw:cn-north-9:9dad60e50dd849e7a6fd445f83227196.stderr-2369-2023-04-23T21:19:24.774242038+08:00 13:19:24 [INFO]   exiting, closing sockets
sarahwooders commented 1 year ago

Are you re-building the gateways with the branch in #803? The instructions for building Skyplane from source are here . Otherwise, I think you may be running the original buggy gateway code. Let me know if not, and I can take another look.

Also, if you let me know your file size and number of files I can try to reproduce the error.

killerdbob commented 1 year ago

I do use the latest gateway. I build a docker registry myself and push the docker image onto that registry.

My file size is 10MB and 8GB in total.

killerdbob commented 1 year ago

I added some logs below.

image image

killerdbob commented 1 year ago

maybe the problem is that the connection failed. This results in the wrong data length, so where to reconnect?


2023-04-24 17:01:17,690| ERROR   | Could not establish connection from local ('127.0.0.1', 34833) to remote ('127.0.0.1', 8081) side of the tunnel: open new channel
ssh error: ChannelException(2, 'Connect failed')```
2023-04-24 17:01:17,711| ERROR   | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-04-24 17:01:17,713| ERROR   | Could not establish connection from local ('127.0.0.1', 34833) to remote ('127.0.0.1', 8081) side of the tunnel: open new channel
ssh error: ChannelException(2, 'Connect failed')
killerdbob commented 1 year ago

where could "Broken pipe" happen?

I think maybe the receiver is too slow to receive and there is too much data sent.

So f.write(to_write) in receiver will slow down the process.

gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6089-2023-04-24T18:57:18.338654481+08:00 10:57:18 [INFO]   started new server connection to 116.204.116.175:58671
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6090-2023-04-24T18:57:18.339297409+08:00 10:57:18 [DEBUG] :['2c48628f94d54b92957d5e113a3faa13'] created new socket
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6091-2023-04-24T18:57:18.339771471+08:00 10:57:18 [INFO]  Chunk 779f9e17902646e1b6ac4c8138daf9db state transition 
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6092-2023-04-24T18:57:18.339778589+08:00 ChunkState.upload_in_progress -> ChunkState.upload_complete
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6093-2023-04-24T18:57:18.340179790+08:00 10:57:18 [DEBUG] :1db4bdeeceaa4827b93c2dead4c824b3 sending chunk header
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6094-2023-04-24T18:57:18.340240019+08:00 10:57:18 [INFO]   started new server connection to 116.204.116.175:34965
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6095-2023-04-24T18:57:18.340836602+08:00 10:57:18 [DEBUG] :['38123e65c1cd4c59869329c9b8e18c26'] created new socket
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6096-2023-04-24T18:57:18.340845221+08:00 10:57:18 [DEBUG] :1db4bdeeceaa4827b93c2dead4c824b3 sent chunk header
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6097-2023-04-24T18:57:18.341248423+08:00 10:57:18 [DEBUG] : 320fea0b30cf4699b2f2c58b25e6fd86 has size 10528282 and should
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6098-2023-04-24T18:57:18.341254428+08:00 be 10528282
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr:6099:2023-04-24T18:57:18.341335012+08:00 10:57:18 [ERROR]  Exception: [Errno 32] Broken pipe
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6100-2023-04-24T18:57:18.342239960+08:00 10:57:18 [INFO]  Chunk 2c48628f94d54b92957d5e113a3faa13 state transition 
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6101-2023-04-24T18:57:18.342251942+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6102-2023-04-24T18:57:18.343247364+08:00 10:57:18 [INFO]   Exiting all workers except for API
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6103-2023-04-24T18:57:18.343293445+08:00 10:57:18 [INFO]   started new server connection to 116.204.116.175:35135
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6104-2023-04-24T18:57:18.343413701+08:00 10:57:18 [INFO]  Chunk 38123e65c1cd4c59869329c9b8e18c26 state transition 
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6105-2023-04-24T18:57:18.343417120+08:00 ChunkState.upload_queued -> ChunkState.upload_in_progress
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6106-2023-04-24T18:57:18.343418929+08:00 10:57:18 [DEBUG] :1db4bdeeceaa4827b93c2dead4c824b3 sent at 43582.90Mbps
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6107-2023-04-24T18:57:18.343952668+08:00 10:57:18 [DEBUG] :['afead984769a400ebd3d64ca2cdd8baf'] created new socket
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6108-2023-04-24T18:57:18.344661954+08:00 10:57:18 [INFO]  Chunk 1db4bdeeceaa4827b93c2dead4c824b3 state transition 
gateway_hw:cn-south-1:ab6c55def7394db0a05379f747b6cec2.stderr-6109-2023-04-24T18:57:18.344668572+08:00 ChunkState.upload_in_progress -> ChunkState.upload_complete
sarahwooders commented 1 year ago

I added some logs below.

image image

I believe this check will fail since after decompression or decryption, chunk_header.data_len wont match the actual data length.

sarahwooders commented 1 year ago

Does the broken pipe issue occur consistency? I will try to reproduce a similar transfer.

killerdbob commented 1 year ago

I added some logs below. image image

I believe this check will fail since after decompression or decryption, chunk_header.data_len wont match the actual data length.

you are right, after decompressing the size is larger.

killerdbob commented 1 year ago

these are all logs. path: cn-south-1 -> cn-east-3 cn-south-1 -> cn-heyuan -> cn-east-3

gateway_hw-cn-south-1-3644bac7242348dda9392de647328748.txt gateway_ali-cn-heyuan-19611fc1234d450f959c6b4b1390ef6e.txt gateway_hw-cn-east-3-80c040cf2d6f4341a2f8564f8b776883.txt gateway_hw-cn-east-3-521a4853d8f243d4853b75dbe87dc037.txt gateway_hw-cn-east-3-1734a570105740ddb1b46f230541bc17.txt gateway_hw-cn-east-3-e129f548142042c5a30751f32f81672f.txt gateway_hw-cn-north-9-00c882d3f148429290a1ba7fd6fe109c.txt gateway_hw-cn-north-9-587b4a6fad194f4c9d984c3a241cdce1.txt gateway_hw-cn-south-1-2c3e177adc8243e8ab18a9830f380d5d.txt gateway_hw-cn-south-1-52d39e2c7ad44671adcebdb1034ce783.txt gateway_hw-cn-south-1-3197a37a19244181b1d290a7070b25b5.txt

2023-04-25T18:42:39.112241398+08:00 [2023-04-25 10:42:39,037] ERROR in app: Exception on /api/v1/servers/60285 [DELETE]
2023-04-25T18:42:39.112245257+08:00 Traceback (most recent call last):
2023-04-25T18:42:39.112247455+08:00   File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 2528, in wsgi_app
2023-04-25T18:42:39.112249796+08:00     response = self.full_dispatch_request()
2023-04-25T18:42:39.112251872+08:00                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-04-25T18:42:39.112262934+08:00   File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1825, in full_dispatch_request
2023-04-25T18:42:39.112265402+08:00     rv = self.handle_user_exception(e)
2023-04-25T18:42:39.112267400+08:00          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-04-25T18:42:39.112269399+08:00   File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1823, in full_dispatch_request
2023-04-25T18:42:39.112271462+08:00     rv = self.dispatch_request()
2023-04-25T18:42:39.112273986+08:00          ^^^^^^^^^^^^^^^^^^^^^^^
2023-04-25T18:42:39.112276092+08:00   File "/usr/local/lib/python3.11/site-packages/flask/app.py", line 1799, in dispatch_request
2023-04-25T18:42:39.112278169+08:00     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
2023-04-25T18:42:39.112280406+08:00            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-04-25T18:42:39.112282420+08:00   File "/pkg/skyplane/gateway/gateway_daemon_api.py", line 120, in remove_server
2023-04-25T18:42:39.112284602+08:00     self.gateway_receiver.stop_server(port)
2023-04-25T18:42:39.112286623+08:00   File "/pkg/skyplane/gateway/gateway_receiver.py", line 121, in stop_server
2023-04-25T18:42:39.112288719+08:00     os.kill(matched_process.pid, signal.SIGINT)
2023-04-25T18:42:39.112290718+08:00 ProcessLookupError: [Errno 3] No such process
killerdbob commented 1 year ago

Do we need to assert here? If it is in the forwarding nodes, the data is compressed, this may not be satisfied.

image

If I commented above code, some files will have a rotted size.

image

I located the problem, it is the file size not correct, this caused the assertion "file size not correct".

sarahwooders commented 1 year ago

@killerdbob sorry but I'm not able to reproduce your issue, with and without compression for my own transfers on the branch. Would you be open to setting up a call? You can email me at wooders@berkeley.edu or join or Slack. It'd be great to figure out what's going on!

killerdbob commented 1 year ago

I have already fixed the bug. and I fix some other big bug.