skoobe / riofs

Userspace S3 filesystem
GNU General Public License v3.0
393 stars 60 forks source link

Latest commits affect stability #149

Closed crodwell closed 6 years ago

crodwell commented 6 years ago

Something in one of the last 2 commits has affected the stability of RioFS: https://github.com/skoobe/riofs/commit/26b76a6fbba6e2ad10cde74dec5ca0560a7ee85f https://github.com/skoobe/riofs/commit/74eba4876040dbe8ce3b07c9bbbc3dbb822e3780

I've been using a February build (and prior) for S3 backed SFTP storage for years with no issues, the current July build will crash every few days and be unrecoverable without a umount under high write load on my debian jessie servers with the following errors:

ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 20 of 20 ERROR! [con: 0x252e420] Reached the maximum number of retries ! ERROR! [ino: 6927, con: 0x252e420] Failed to send buffer to server ! ERROR! [ino: 6927] Write call with offset 62941341 is not allowed ! ERROR! [ino: 6927] Write call with offset 62973970 is not allowed ! ERROR! [ino: 6927] Write call with offset 63006599 is not allowed ! ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 1 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 2 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 3 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 4 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 5 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 6 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 7 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 8 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 9 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 10 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 11 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 12 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 13 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 14 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 15 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 16 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 17 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 18 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 19 of 20 ERROR! [con: 0x252dea0] Server returned HTTP error: 400 (Bad Request)! Retry ID: 20 of 20 ERROR! [con: 0x252dea0] Reached the maximum number of retries ! ERROR! [ino: 6927, con: 0x252dea0] Failed to send Multipart data to the server ! ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 1 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 2 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 3 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 4 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 5 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 6 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 7 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 8 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 9 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 10 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 11 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 12 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 13 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 14 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 15 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 16 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 17 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 18 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 19 of 20 ERROR! [con: 0x252e420] Request failed ! ERROR! [con: 0x252e420] Server returned HTTP error ! Retry ID: 20 of 20 ERROR! [con: 0x252e420] Reached the maximum number of retries ! ERROR! [ino: 6942, con: 0x252e420] Failed to send buffer to server ! ERROR! Mountpoint /d1/s3 does not exist! Please check directory permissions!

wizzard commented 6 years ago

Hello, could you please tell me which region are you using?

crodwell commented 6 years ago

Hi, it's us-east-1. I better point out that I have deployed it using ansible on new EC2s (m4.large), rather than update what was working (t2.medium) so perhaps the build is not to blame.

wizzard commented 6 years ago

us-east-1 uses V2 protocol, so the new S3 protocol is not an issue. So this is a different issue. Unfortunately with this limited log I can't tell you exactly what is the reason of your problem. Could you please check logs and see if you are getting HTTP error right at the moment you start RioFS (before performing any file operations) ?

crodwell commented 6 years ago

No nothing is output on start. I get nothing ever to STDERR, and the errors above come from STDOUT only when it's failing. Not sure if there is way to get something more verbose? I run it from supervisord and am only reading what's in /var/log/supervisor

crodwell commented 6 years ago

I mount 5 S3 buckets with riofs on each of my 4 SFTP servers, only the S3 bucket receiving high writes is intermittently failing. I ran atop on one of these and can see riofs consumes up to 100% of memory before failure occurs and riofs then exits with code 9. supervisord is unable to recover as the mount point is in use. I've restarted riofs now with -v turned on to see if I can get more logging with a log rotation. The amount I've writes I get mean the riofs logs grow at a rate of almost 10GB per hour

crodwell commented 6 years ago

Interestingly, I reduced the amount of SFTP servers writing to the same S3 bucket from 4 to 3 as I had before and the RioFS mounts are stable again.