gdrive service not unmounting cleanly when killed

mrmachine commented 6 years ago

Describe the bug

Not sure why, but after some seemingly random amount of time my gdrive service will be killed, and it does not start up again.

root@vultr:~# ls -la /mnt/gdrive 
ls: cannot access '/mnt/gdrive': Transport endpoint is not connected
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Mon 2018-09-24 19:13:13 UTC; 4h 31min ago
  Process: 1852 ExecStart=/usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache
 Main PID: 1852 (code=killed, signal=KILL)

Sep 24 13:05:24 vultr.guest systemd[1]: Started GDrive Daemon.
Sep 24 19:13:13 vultr.guest systemd[1]: gdrive.service: Main process exited, code=killed, status=9/KILL
Sep 24 19:13:13 vultr.guest systemd[1]: gdrive.service: Failed with result 'signal'.

I am unable to manually restart it:

root@vultr:~# systemctl start gdrive
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2018-09-24 23:54:58 UTC; 6s ago
  Process: 31859 ExecStart=/usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cach
 Main PID: 31859 (code=exited, status=1/FAILURE)

Sep 24 23:54:58 vultr.guest systemd[1]: Started GDrive Daemon.
Sep 24 23:54:58 vultr.guest rclone[31859]: mount helper error: fusermount: failed to access mountpoint /mnt/gdrive: Transport endpoint is not connected
Sep 24 23:54:58 vultr.guest rclone[31859]: Fatal error: failed to mount FUSE fs: fusermount: exit status 1
Sep 24 23:54:58 vultr.guest systemd[1]: gdrive.service: Main process exited, code=exited, status=1/FAILURE
Sep 24 23:54:58 vultr.guest systemd[1]: gdrive.service: Failed with result 'exit-code'.

But after attempting to restart it, I got distracted for half an hour, and it seems that a second attempt to start it worked:

root@vultr:~# systemctl start gdrive
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-09-25 00:31:58 UTC; 2s ago
 Main PID: 378 (rclone)
    Tasks: 8 (limit: 1110)
   CGroup: /system.slice/gdrive.service
           └─378 /usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache-max-age 

Sep 25 00:31:58 vultr.guest systemd[1]: Started GDrive Daemon.

To Reproduce

When running, kill rclone:

root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-09-25 00:31:58 UTC; 3min 20s ago
 Main PID: 378 (rclone)
    Tasks: 8 (limit: 1110)
   CGroup: /system.slice/gdrive.service
           └─378 /usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache-max-age 

Sep 25 00:31:58 vultr.guest systemd[1]: Started GDrive Daemon.
root@vultr:~# kill -9 378
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2018-09-25 00:35:28 UTC; 3s ago
  Process: 378 ExecStart=/usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache-
 Main PID: 378 (code=killed, signal=KILL)

Sep 25 00:31:58 vultr.guest systemd[1]: Started GDrive Daemon.
Sep 25 00:35:28 vultr.guest systemd[1]: gdrive.service: Main process exited, code=killed, status=9/KILL
Sep 25 00:35:28 vultr.guest systemd[1]: gdrive.service: Failed with result 'signal'.
root@vultr:~# ls -la /mnt/gdrive 
ls: cannot access '/mnt/gdrive': Transport endpoint is not connected

Additional attempts to restart it are not working, though. Maybe I need to wait longer?

Expected behavior

If rclone mount is killed for any reason, it should cleanup properly and restart automatically.
A notification via healthchecks.io when the mount is NOT working also be great.

What is affected by this bug?

Plex reports that files are unavailable.
I suspect that the local unionfs directory may fill up the local disk with downloads that aren't being moved.
I suspect that some apps that access /mnt/gdrive might need to be restarted to pickup the new mount. This might not affect apps that access it through the unionfs mount.

When does this occur?

Not what what is killing rclone, but the failure to cleanup seems to happen any time rclone is killed.

Please tell us about your environment: (REPLACE EXAMPLE VALUES BELOW WITH YOUR ACTUAL VALUES)

Operating System: Ubuntu 18.04
Plexguide Version: 6.055
Plexguide Server Type: Remote VPS
Plexguide Edition: Gdrive
Plexguide Mount System: Move+PGDrive
Using Traefik: Yes
Domain DNS Provider: CloudFlare

Logs

root@vultr:~# pglog
24/09/2018 06:25:50 INFO - Start of Script: Docker Install
24/09/2018 06:25:50 INFO - END of Script: Docker Install
24/09/2018 06:25:50 INFO - Start of Script: Docker Start Assist
24/09/2018 06:25:50 INFO - END of Script: Docker Start Assist
24/09/2018 06:25:50 INFO - Start of Script: PG Installer: WatchTower Install
24/09/2018 06:25:50 INFO - END of Script: PG Installer: WatchTower Install
24/09/2018 06:25:50 INFO - Start of Script: PG Installer: MOTD Install
24/09/2018 06:25:50 INFO - END of Script: PG Installer: MOTD Install
24/09/2018 06:25:50 INFO - Start of Script: PG Installer: RClone Install
24/09/2018 06:25:50 INFO - END of Script: PG Installer: RClone Install
24/09/2018 06:25:50 INFO - Start of Script: PG Installer: Cleaner Functions
24/09/2018 06:25:50 INFO - END of Script: PG Installer: Cleaner Functions
24/09/2018 06:25:50 INFO - Start of Script: PG Installer: Set PG Edition
24/09/2018 06:25:50 
24/09/2018 06:25:50 INFO - END of Script: PG Installer: Set PG Edition
24/09/2018 06:25:50 INFO - END of Script: PG Installer: Set PG Edition
24/09/2018 06:25:50 INFO - Deploying GDrive Interface Menu
24/09/2018 06:26:01 INFO - Ansible Role portainer Started
24/09/2018 06:26:06 INFO - Ansible Role portainer Started
24/09/2018 06:26:13 INFO - Ansible Role portainer Completed
24/09/2018 06:28:14 INFO - Selected: Deploy a Mount System
24/09/2018 06:28:18 INFO - Selected: PG Move - PG Drive
24/09/2018 06:28:25 INFO - DEPLOYED PG Drive
24/09/2018 06:29:21 INFO - Looping: PG Move System Select Interface
24/09/2018 06:29:26 INFO - Looping: PG Move System Select Interface
24/09/2018 06:29:26 INFO - Exiting: PG Move System Select Interface
24/09/2018 06:29:26 INFO - Looping: Transport System Select Interface
24/09/2018 06:29:28 INFO - Looping: Transport System Select Interface
24/09/2018 06:29:28 INFO - Exiting: Transport System Select Interface
24/09/2018 06:29:28 INFO - Looping: Main GDrive Interface Menu
24/09/2018 06:29:33 INFO - Looping: Main GDrive Interface Menu
24/09/2018 06:29:33 INFO - Selected: Exiting PlexGuide
24/09/2018 06:29:33 
24/09/2018 06:29:33 INFO - Start of Script: YML Generation
24/09/2018 06:29:33 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 06:29:33 INFO - END of Script: YML Generation
24/09/2018 13:00:35 INFO - Start of Script: YML Generation
24/09/2018 13:00:35 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:00:35 INFO - END of Script: YML Generation
24/09/2018 13:00:35 INFO - Start of Script: YML Generation
24/09/2018 13:00:35 INFO - Deployed PGChecker.bin to GDrive - PGChecker.Bin
24/09/2018 13:00:35 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:00:35 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:00:35 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:00:35 INFO - END of Script: YML Generation
n
24/09/2018 13:00:35 INFO - END of Script: YML Generation
n
24/09/2018 13:00:35 INFO - END of Script: YML Generation
n
24/09/2018 13:00:35 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:00:36 INFO - END of Script: YML Generation
24/09/2018 13:00:36 INFO - Start of Script: PG Installer: Pre-Install Tasks
24/09/2018 13:00:36 INFO - END of Script: PG Installer: Pre-Install Tasks
24/09/2018 13:00:36 INFO - Start of Script: Ansible - Install
24/09/2018 13:00:36 INFO - END of Script: Ansible - Install
24/09/2018 13:00:36 INFO - Start of Script: PG Installer: Alias Install
24/09/2018 13:00:36 INFO - END of Script: PG Installer: Alias Install
24/09/2018 13:00:36 INFO - Start of Script: PG Installer: Server ID
24/09/2018 13:00:36 INFO - END of Script: PG Installer: Server ID
24/09/2018 13:00:36 INFO - Start of Script: PG Installer: Dependency Commands
24/09/2018 13:00:36 INFO - END of Script: PG Installer: Dependency Commands
24/09/2018 13:00:36 INFO - Start of Script: Folders & Directories - Install
24/09/2018 13:00:36 INFO - END of Script: Folders & Directories - Install
24/09/2018 13:00:36 INFO - Start of Script: Docker Install
24/09/2018 13:00:36 INFO - END of Script: Docker Install
24/09/2018 13:00:36 INFO - Start of Script: Docker Start Assist
24/09/2018 13:00:36 INFO - END of Script: Docker Start Assist
24/09/2018 13:00:36 INFO - Start of Script: PG Installer: WatchTower Install
24/09/2018 13:00:37 INFO - END of Script: PG Installer: WatchTower Install
24/09/2018 13:00:37 INFO - Start of Script: PG Installer: MOTD Install
24/09/2018 13:00:37 INFO - END of Script: PG Installer: MOTD Install
24/09/2018 13:00:37 INFO - Start of Script: PG Installer: RClone Install
24/09/2018 13:00:37 INFO - END of Script: PG Installer: RClone Install
24/09/2018 13:00:37 INFO - Start of Script: PG Installer: Cleaner Functions
24/09/2018 13:00:37 INFO - END of Script: PG Installer: Cleaner Functions
24/09/2018 13:00:37 INFO - Start of Script: PG Installer: Set PG Edition
24/09/2018 13:00:37 
24/09/2018 13:00:37 INFO - END of Script: PG Installer: Set PG Edition
24/09/2018 13:00:37 INFO - END of Script: PG Installer: Set PG Edition
24/09/2018 13:00:37 INFO - Deploying GDrive Interface Menu
24/09/2018 13:01:03 INFO - Ansible Role portainer Started
24/09/2018 13:01:08 INFO - Selected: Deploy a Mount System
24/09/2018 13:01:16 INFO - Ansible Role portainer Started
24/09/2018 13:01:19 INFO - Selected: PG Move - PG Drive
24/09/2018 13:01:31 INFO - DEPLOYED PG Drive
24/09/2018 13:01:36 INFO - Ansible Role portainer Completed
24/09/2018 13:02:27 INFO - Looping: PG Move System Select Interface
24/09/2018 13:02:49 INFO - Looping: PG Move System Select Interface
24/09/2018 13:02:49 INFO - Exiting: PG Move System Select Interface
24/09/2018 13:02:49 INFO - Looping: Transport System Select Interface
24/09/2018 13:03:16 INFO - Looping: Transport System Select Interface
24/09/2018 13:03:16 INFO - Exiting: Transport System Select Interface
24/09/2018 13:03:16 INFO - Looping: Main GDrive Interface Menu
24/09/2018 13:03:28 INFO - Looping: Main GDrive Interface Menu
24/09/2018 13:03:28 INFO - Selected: Exiting PlexGuide
24/09/2018 13:03:28 INFO - Program Exited Properly
24/09/2018 13:03:28 INFO - Start of Script: YML Generation
24/09/2018 13:03:28 INFO - YML List Generated @ /var/plexguide/yml.list
24/09/2018 13:03:28 INFO - END of Script: YML Generation

Additional context

On my old server, I had a bash script that trapped TERM and EXIT to execute fusermount -uz and that seemed to work reliably. Perhaps we need to execute a similar script in ExecStart instead of executing rclone directly, so we can more reliably execute cleanup?

Here's my script:

#!/usr/bin/env bash

set -ex

HEALTHCHECK_ID="..."
RCLONE_REMOTE="GdriveMedia"

MOUNT_DIR="$HOME/mnt/$RCLONE_REMOTE"

# Unmount and remove mount point directory.
cleanup() {
    # Mount is up.
    if mount | grep -q "$MOUNT_DIR"; then
        echo "$(date) Unmounting '$MOUNT_DIR'." >> ~/run.log
        if [[ "$OSTYPE" == darwin* ]]; then
            diskutil unmount "$MOUNT_DIR"
        else
            fusermount -quz "$MOUNT_DIR"
        fi
    fi

    echo "$(date) Stopping apps that need '$MOUNT_DIR' to be mounted." >> ~/run.log
    pkill -f Headphones.py || true
    pkill -f NzbDrone.exe || true
#   pkill -f -9 Ombi.exe || true
    pkill -f 'Plex Media Server' || true
    pkill -f Radarr.exe || true
#   pkill -f nzbget || true
    sleep 5

    echo "$(date) Removing mount point directory '$MOUNT_DIR'." >> ~/run.log
    rmdir "$MOUNT_DIR"
}

# Mount point directory already exists. We expect the mount to be up.
if [[ -d "$MOUNT_DIR" ]]; then
    # Files found. Mount is up. Nothing to do.
    if [[ -n "$(ls -A "$MOUNT_DIR")" ]]; then
        echo "$(date) Mount '$MOUNT_DIR' is already up. Nothing to do." >> ~/run.log
        if [[ -n "$HEALTHCHECK_ID" ]]; then
            curl -fsS --retry 3 "https://hchk.io/$HEALTHCHECK_ID" &>/dev/null || true
        fi
        exit 0
    # No files found. Mount is down. Cleanup before re-mounting.
    # This assumption fails when there are legitimately no files in the mount.
    else
        echo "$(date) Mount '$MOUNT_DIR' is up but health check failed. Cleaning up." >> ~/run.log
        cleanup
    fi
fi

# Cleanup on interrupt or exit.
trap cleanup TERM EXIT

# Kill Plex Media Server *after* rclone mount is up.
(sleep 5; echo "$(date) Restarting Plex Media Server" >> ~/run.log; pkill -f 'Plex Media Server') &

# Mount rclone remote in foreground, to block cleanup until interrupt or exit.
echo "$(date) Mounting '$MOUNT_DIR'." >> ~/run.log
mkdir -p "$MOUNT_DIR"
ionice -c2 -n7 nice -n19 rclone mount --cache-writes --stats=1s --verbose --vfs-cache-mode writes "${RCLONE_REMOTE}Cache:" "$MOUNT_DIR"

mrmachine commented 6 years ago

Update: I think the issue did not just resolve itself when I got distracted for half an hour. I think I manually executed fusermount -uz /mnt/gdrive (and forgot). I can reproduce:

root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-09-25 00:51:11 UTC; 1min 37s ago
 Main PID: 1098 (rclone)
    Tasks: 9 (limit: 1110)
   CGroup: /system.slice/gdrive.service
           └─1098 /usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache-max-age

Sep 25 00:51:11 vultr.guest systemd[1]: Started GDrive Daemon.
root@vultr:~# kill -9 1098
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2018-09-25 00:52:56 UTC; 3s ago
  Process: 1098 ExecStart=/usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache
 Main PID: 1098 (code=killed, signal=KILL)

Sep 25 00:51:11 vultr.guest systemd[1]: Started GDrive Daemon.
Sep 25 00:52:56 vultr.guest systemd[1]: gdrive.service: Main process exited, code=killed, status=9/KILL
Sep 25 00:52:56 vultr.guest systemd[1]: gdrive.service: Failed with result 'signal'.
root@vultr:~# ls -la /mnt/gdrive 
ls: cannot access '/mnt/gdrive': Transport endpoint is not connected
root@vultr:~# fusermount -uz /mnt/gdrive
root@vultr:~# ls -la /mnt/gdrive 
total 12
drwxr-xr-x  3 root root 4096 Sep 23 13:10 .
drwxr-xr-x 19 root root 4096 Sep 24 04:28 ..
drwxr-xr-x  2 root root 4096 Sep 23 13:10 plexguide
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Tue 2018-09-25 00:52:56 UTC; 29s ago
  Process: 1098 ExecStart=/usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache
 Main PID: 1098 (code=killed, signal=KILL)

Sep 25 00:51:11 vultr.guest systemd[1]: Started GDrive Daemon.
Sep 25 00:52:56 vultr.guest systemd[1]: gdrive.service: Main process exited, code=killed, status=9/KILL
Sep 25 00:52:56 vultr.guest systemd[1]: gdrive.service: Failed with result 'signal'.
root@vultr:~# systemctl start gdrive
root@vultr:~# systemctl status gdrive
● gdrive.service - GDrive Daemon
   Loaded: loaded (/etc/systemd/system/gdrive.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-09-25 00:53:29 UTC; 2s ago
 Main PID: 1198 (rclone)
    Tasks: 8 (limit: 1110)
   CGroup: /system.slice/gdrive.service
           └─1198 /usr/bin/rclone --allow-non-empty --allow-other mount gdrive: /mnt/gdrive --uid=1000 --gid=1000 --size-only --dir-cache-time=2m --vfs-read-chunk-size=64M --vfs-cache-max-age

Sep 25 00:53:29 vultr.guest systemd[1]: Started GDrive Daemon.

Admin9705 commented 6 years ago

Ya I ran into this on a test. Will kook to see what can be done

Admin9705 commented 5 years ago

Did some rewrites and have not seen any reports. Please open if encountered

plexguide / PlexGuide.com

gdrive service not unmounting cleanly when killed #575