lukechampine / user

A CLI renter for Sia
MIT License
12 stars 2 forks source link

Panic error when trying to download #8

Open grigzy28 opened 5 years ago

grigzy28 commented 5 years ago

After having ~15 host contracts expire, trying to download from the rest (out of 50) it gets this panic error(plus it was about 5 minutes before the error showed up after issuing the download command.)

root@sia-test:~# user download DOGP.zip.50hosts.usa
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x5f7906]

goroutine 132 [running]:
lukechampine.com/us/renter.(*ShardDownloader).CopySection(0x0, 0x92af80, 0xc003e65b90, 0x0, 0x400000, 0xc0001b09c0, 0xc003df8e10)
    /root/go/src/lukechampine.com/us/renter/download.go:116 +0x26
lukechampine.com/us/renter/renterutil.(*PseudoFS).fileReadAt.func1(0xc003f2e000, 0xc0000c4000, 0x32, 0x32, 0x0, 0x400000, 0xc003f2c000, 0x32, 0x32, 0xc000179c20)
    /root/go/src/lukechampine.com/us/renter/renterutil/fileops.go:430 +0x11c
created by lukechampine.com/us/renter/renterutil.(*PseudoFS).fileReadAt
    /root/go/src/lukechampine.com/us/renter/renterutil/fileops.go:427 +0x499
root@sia-test:~# 
lukechampine commented 5 years ago

Just pushed https://github.com/lukechampine/us/commit/e35e7d615d1b2b5d9ee5f96fc9f49c5b89368803 which should fix this.

grigzy28 commented 5 years ago

This is what I got when trying to download the file after the update.

root@sia-test:~# user
user v0.5.0
Commit:     394a6a7
Release:    standard
Go version: go1.12.7 linux/amd64
Build Date: Mon Jul 29 08:54:26 CDT 2019
root@sia-test:~# user download DOGP.zip.50hosts.usa

Download failed: too many hosts did not supply their shard:
65b443b1: write tcp 192.168.1.4:48950->99.24.137.30:9982: write: broken pipe
e90e13f6: write tcp 192.168.1.4:59142->93.185.103.69:9982: write: broken pipe
54aecc1a: write tcp 192.168.1.4:36982->91.210.26.173:9982: write: broken pipe
31b46fe6: write tcp 192.168.1.4:34016->86.89.6.218:9982: write: broken pipe
068b92bd: write tcp 192.168.1.4:59222->94.181.94.119:9982: write: broken pipe
a5781046: write tcp 192.168.1.4:38698->188.63.201.170:9982: write: broken pipe
232ac2c4: write tcp 192.168.1.4:46648->136.52.73.104:9982: write: broken pipe
f1120bd9: write tcp 192.168.1.4:32844->85.253.59.74:9982: write: broken pipe
a049f625: no record of that host
c4392107: write tcp 192.168.1.4:32962->37.110.43.227:9982: write: broken pipe
19e391a6: no record of that host
ff01fbb0: dial tcp 89.175.213.66:9982: connect: connection timed out
2964c021: write tcp 192.168.1.4:54868->69.116.14.91:9982: write: broken pipe
60c1bb4b: write tcp 192.168.1.4:52692->60.42.165.199:9982: write: broken pipe
16fee7f2: write tcp 192.168.1.4:36378->95.23.183.8:9982: write: broken pipe
2a427bd2: write tcp 192.168.1.4:37694->213.183.59.173:9982: write: broken pipe
90d8e058: write tcp 192.168.1.4:49078->217.76.27.226:9982: write: broken pipe
4aa9bba2: write tcp 192.168.1.4:45686->188.243.4.139:9982: write: broken pipe
938c7b7a: dial tcp 47.145.61.39:9982: connect: connection timed out
80123282: write tcp 192.168.1.4:44356->59.167.191.60:9982: write: broken pipe
5463e8bf: write tcp 192.168.1.4:47946->126.150.92.41:9982: write: broken pipe
1b0feb79: write tcp 192.168.1.4:34366->190.19.225.96:9982: write: broken pipe
d17ca562: dial tcp 93.66.72.225:9982: connect: connection timed out
86b2eb34: write tcp 192.168.1.4:36862->87.79.193.8:40335: write: broken pipe
f2c044e5: unexpected EOF
b7b3f35f: unexpected EOF
cbf46f11: unexpected EOF
b2eea54e: unexpected EOF
5bdf37aa: unexpected EOF
e96501f8: unexpected EOF
ea3106bf: unexpected EOF
d9cd1249: write tcp 192.168.1.4:55748->108.45.150.184:9982: write: broken pipe
32b0a73f: write tcp 192.168.1.4:59602->63.155.20.95:9982: write: broken pipe
4074d176: unexpected EOF
2161a3b2: unexpected EOF
3edb0748: unexpected EOF
f037506e: no record of that host
57dbc08d: dial tcp 80.56.242.67:9982: connect: connection timed out
5ea754df: no record of that host
0d61e499: write tcp 192.168.1.4:41612->103.1.185.150:9982: write: broken pipe
9eef4a32: no record of that host
ce644456: write tcp 192.168.1.4:59142->195.158.109.58:9982: write: broken pipe
4b29dfcd: write tcp 192.168.1.4:47064->86.244.150.77:9982: write: broken pipe
76f9101f: unexpected EOF
0a04704f: unexpected EOF
fc608eeb: unexpected EOF
3e9aadf1: write tcp 192.168.1.4:50486->79.114.51.149:9982: write: broken pipe
92ab5556: write tcp 192.168.1.4:43544->101.100.158.73:9982: write: broken pipe
13180d9b: unexpected EOF
root@sia-test:~#

Could all the broken pipes be caused by the connection timed out error, similar to the issue with the upload? There are enough active hosts(assuming the broken pipes are actually good) to be able to put together this test file. It was uploaded with the -m 10 parameter. Can't explain the "unexpected EOF" in what that means, Possibly active host, no contract anymore or would that be the no record result?

lukechampine commented 5 years ago

"no record" means that the host is listed in the metafile, but wasn't found in the set of contracts being used. "unexpected EOF" means we tried to read something, but the connection was closed. "broken pipe" means that we tried to write something, but the connection was closed.

The main reason it's hard to debug this is that the errors don't have enough context. We know that the connection failed at some point, but we don't know what we were trying to do when it failed. I just pushed https://github.com/lukechampine/us/commit/3608c5e1c76a2c601671d8e06546e00d279b3a65 which wraps those errors with additional context. If you can reproduce the errors with that commit, it should be much easier to figure out what's going wrong.

grigzy28 commented 5 years ago

All I get now is this...

root@sia-test:~# user download DOGP.zip.50hosts.usa

Download failed: insufficient hosts to recover file data (needed 10, got 0)

It doesn't even attempt to get any hosts. Immediate failure.

lukechampine commented 5 years ago

It's possible that it did try to connect, and failed, but didn't print the failure message(s). https://github.com/lukechampine/us/commit/5dfe5a459d6831d2eff706ca7dfca462bf13f2d6 should fix that.

It's also possible that you no longer have any contract overlap with the file's hosts. Did you disable them? Did they expire? You can compare the output of user contracts list and user info DOGP.zip.50hosts.usa to see the overlap.

grigzy28 commented 5 years ago

Okay, downloaded and recompiled, here is the results...

root@sia-test:~# user download DOGP.zip.test50.usa

Download failed: insufficient hosts to recover file data (needed 10, got 0):
4aa9bba2: could not resolve host key: [request failed; Get http://localhost:9980/hostdb/hosts/ed25519:4aa9bba23268efc40e364c04b8fa04be15ce0510738bdf1ef2ed4e88036f21a8: dial tcp [::1]:9980: connect: connection refused]
57dbc08d: could not resolve host key: [request failed; Get http://localhost:9980/hostdb/hosts/ed25519:57dbc08d7ac48f583f5fdcf03e49879d33dfab66eeb3792b5b56119bea960b32: dial tcp [::1]:9980: connect: connection refused]
... and so on

This is why it was an immediate failure, it is trying to connect to my localhost computer instead of the remote sia host computers. There was another error that listed no record of that host, probably a disabled contract but there was only one of that error. Thanks

lukechampine commented 5 years ago

Interesting. I wonder if perhaps we should request the current block height or something before proceeding with the download, just to check if the server is reachable. Would improve UX a little, and the overhead would be negligible.

grigzy28 commented 5 years ago

I am sorry for this, but I had rebooted that virtual machine after some updates and had forgotten to start siad. Which is why it failed to connect to the localhost:9980.

however, it did start working, here were the results, it did take somewhere between 5-10 minutes to get it. I wasn't at the console after about five minutes had already passed.


root@sia-test:~# user download DOGP.zip.test50.usa

Download failed: too many hosts did not supply their shard:
57dbc08d: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:32876->80.56.242.67:9982: write: connection timed out
f2c044e5: NewSession: NewUnlockedSession: dial tcp 72.210.62.141:9982: connect: connection timed out
4aa9bba2: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:33160->188.243.4.139:9982: write: broken pipe
2964c021: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:35254->69.116.14.91:9982: write: broken pipe
c4392107: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:42156->37.110.43.227:9982: write: broken pipe
1b0feb79: NewSession: NewUnlockedSession: dial tcp 190.19.225.96:9982: connect: no route to host
2161a3b2: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:36798->78.34.199.27:40315: write: broken pipe
13180d9b: Read: ReadResponse: unexpected EOF
232ac2c4: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:53078->136.52.73.104:9982: write: broken pipe
a5781046: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:41904->188.63.201.170:9982: write: broken pipe
54aecc1a: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:42694->91.210.26.173:9982: write: broken pipe
0a04704f: NewSession: NewUnlockedSession: dial tcp 93.175.199.149:9982: connect: connection refused
b7b3f35f: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:58426->86.153.94.63:9982: write: broken pipe
60c1bb4b: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:41474->60.42.165.199:9982: write: broken pipe
4074d176: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:60608->88.208.237.16:9982: write: broken pipe
ff01fbb0: NewSession: NewUnlockedSession: dial tcp 89.175.213.66:9982: connect: connection timed out
0d61e499: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:58840->103.1.185.150:9982: write: broken pipe
86b2eb34: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:44452->78.34.199.27:40335: write: broken pipe
9eef4a32: no record of that host
80123282: NewSession: NewUnlockedSession: dial tcp 59.167.191.60:9982: connect: connection timed out
068b92bd: NewSession: NewUnlockedSession: dial tcp 94.181.94.119:9982: connect: connection timed out
5bdf37aa: Read: ReadResponse: unexpected EOF
ac25d9f9: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:47678->178.201.150.19:9982: write: broken pipe
90d8e058: Read: ReadResponse: unexpected EOF
b2eea54e: Read: ReadResponse: unexpected EOF
d9cd1249: Read: ReadResponse: unexpected EOF
e90e13f6: Read: couldn't write RPC ID: WriteRequest: write tcp 192.168.1.4:36318->93.185.103.69:9982: write: broken pipe
16fee7f2: Read: ReadResponse: unexpected EOF
2a427bd2: Read: ReadResponse: unexpected EOF
4b29dfcd: NewSession: NewUnlockedSession: dial tcp 86.244.150.77:9982: connect: connection timed out
d17ca562: could not resolve host key: host announcement not found
ea3106bf: Read: ReadResponse: unexpected EOF
fc608eeb: Read: ReadResponse: unexpected EOF
76f9101f: Read: ReadResponse: unexpected EOF
65b443b1: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline
cbf46f11: NewSession: NewUnlockedSession: NewRenterSession: read tcp 192.168.1.4:34614->82.161.41.26:9982: i/o timeout
3edb0748: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline
32b0a73f: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline
5463e8bf: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline
3e9aadf1: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline
92ab5556: Read: ReadResponse: communication error: renter is requesting revision after the revision deadline```

I am going to disable some of the contracts and see what it does (the timeout one mainly since it seems to cause the other errors like in uploading.)
lukechampine commented 5 years ago

Ok. Well, we're making progress at least.. There are two concerning classes of errors here:

What's strange is that, in both cases, the host should be sending us an error indicating why it's closing the connection. But we're either not receiving that error or not processing it correctly. My guess is that the host is calling SetDeadline on the connection, which means it will automatically close after the deadline. So even if the host wanted to tell us about the timeout, it wouldn't be able to, because the connection would be closed.

Right now the host is definitely using a too-short deadline. We should increase that in v1.4.2. But in the meantime, we need some other way of ensuring that we write our request before the deadline is reached.

grigzy28 commented 5 years ago

I disabled all the connection timed out contracts. Then did attempt the download, I got this

root@sia-test:~# user download DOGP.zip.test50.usa
DOGP.zip.test50                                                                                                                                                         33%   125.83 MB   455.4 KB/s    
root@sia-test:~#

It proceeded to download but didn't complete, and didn't give any errors as to why it stopped.

grigzy28 commented 5 years ago

The file did download correctly, eventually, stopping at 66%(again without any feedback) and then finished with 100% on the third attempt. The file did fail verification of zip testing, but that is my fault. I had been testing so many variables that I copied a file that was partially downloaded from a previous test.

This was also after disabling all contracts that had the connection timed out error and only those errors, nothing else was disabled in the list of errors above. Just like the upload error, when disabling all the connection timed out errors it would proceed with the upload.

lukechampine commented 5 years ago

hmm. If DOGP.zip.test50 was already present on-disk, then it's possible that it stopped at 33% because the file was already 67% downloaded. But that behavior would be confusing; it should start at 67% and continue to 100%.

Can you test this behavior again? You can simply Ctrl-C in the middle of a download, then run the download command again. Then observe what percentage it starts and ends at.

grigzy28 commented 5 years ago

The file wasn't present on disk, I started the download with no file present on the disk, it stopped at 33% and exited on it's own without any errors or other indications as to why it quit downloading. Restarted with the file already present at 33%, it started from the 33% and went to 67% which is where it stopped and exited again without any indication as to why. Then the third time it finished at 100% after starting from the 67%. So it performed normal as far as resuming and displaying where it started resuming from, the odd behavior is why it just stopped and exited without any error or reason being listed but was able to resume without any problem.

MeijeSibbel commented 4 years ago

https://github.com/storewise/s3-gateway/issues/68