Share on external USB drive on RPi: folder content unavailable after a while

hollie commented 3 years ago

Share appears empty after running the container for a while

Note: this is a FYI notification of an issue that I am experiencing with my specific setup. The goal is to document what I see and hopefully to be able to resolve the issue. I am quite sure that docker-timemachine is not the cause of the issue but the fact the issue pops up is quite annoying when using the docker-timemachine specifically because it causes the backup process to fail. Which is not the idea of course 😄

I am running the latest mbentley/timemachine:smb-armv7l on a RPi4. I mount a USB hard drive with different partitions for every client under the /media folder on the RPi.

Then I have the docker image configured to serve each partition as a separate network share to different clients that I want to backup. I use this configuration to ensure that the different backup images don't compete with each other for space.

The hard drive is configured to go to sleep after a period of inactivity on the host using hd-idle.

What happens is that after some time of the container running the timemachine services on a single client starts complaining that it cannot complete the backup process because the 'network backup drive does not support the required service'. (That last sentence is translated so the wording could be off. The exact phrase is 'de netwerkreservekopieschijf ondersteunt de vereiste voorzieningen niet').

Indeed, when I try to mount the network share from a client machine (either the machine that produced the error, or a different machine) the network share appears to be empty. When I try to create a file on the share it fails.

When I ssh into the server (the host) and check the mounted hard drive folder I can see the expected content. When I enter the docker image using docker-compose exec timemachine /bin/ash and navigate to the network share under opt I see the folder is empty which matches with what SMD serves to the client. So it is logical that SMB in the container does not serve the files in the share anymore.

The weird thing is that this only happens for a single share and that machines that backup up to other shares continue to work as expected. So I end up with a timemachine server that allows one client to backup and another one not.

My current thinking is that this is related to the hard drive going to sleep and docker not being able to recover a functional mapping of the hard drive mount point when the drive needs to wakeup from sleep.

I'll now first try to disable the HD sleep to see if this is a first workaround to keep the backup process on the multiple machines working as expected.

Finally I can add: restarting the docker container results in a working backup process on all clients for some time, and then after some time backup fails for another client with the same behaviour as described above.

So as a recap: my current line of thinking is that the hd-idle messes up with docker being able to mount the USB share point after running the container for a while. Probably this fails when the drive is spinning up, and then other machines that try to backup while the drive is still spinning are able to access the folder contents. If anybody has similar experiences or a solution I am happy to learn about it.

To Reproduce Steps to reproduce the behavior:

mount multiple partitions on a USB drive
configure hd-idle to sleep the drive
configure multiple machines to backup to the server
let it run for a few days
when one of the machines reports a failure to backup, login to the docker host and verify that the USB mount point does indeed still work as expected. Then enter the docker container and verify that inside the container the folder appears to be empty.

Expected behavior I expect the backup service to continue working without issues.

How you're launching your container

docker-compose.yml

services:
  timemachine:
    hostname: timemachine
    mac_address: "redacted"
    networks:
      timemachine:
        ipv4_address: redacted
    environment:
      - CUSTOM_SMB_CONF=true
      - CUSTOM_USER=false
      - DEBUG_LEVEL=1
      - EXTERNAL_CONF=
      - HIDE_SHARES=no
      - MIMIC_MODEL=TimeCapsule8,119
      - TM_USERNAME=timemachine
      - TM_GROUPNAME=timemachine
      - TM_UID=1000
      - TM_GID=1000
      - PASSWORD=redacted
      - SET_PERMISSIONS=false
      - SHARE_NAME=TimeMachine
      - SMB_INHERIT_PERMISSIONS=no
      - SMB_NFS_ACES=yes
      - SMB_METADATA=stream
      - SMB_PORT=445
      - SMB_VFS_OBJECTS=acl_xattr fruit streams_xattr
      - VOLUME_SIZE_LIMIT=0
      - WORKGROUP=WORKGROUP
      - TZ=Europe/Brussels
    restart: unless-stopped
    volumes:
      - /media:/opt
      - ./timemachine/conf/smb.conf:/etc/samba/smb.conf
      - timemachine-var-lib-samba:/var/lib/samba
      - timemachine-var-cache-samba:/var/cache/samba
      - timemachine-run-samba:/run/samba
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
    container_name: timemachine
    image: mbentley/timemachine:smb-armv7l

volumes:
  timemachine-var-lib-samba:
  timemachine-var-cache-samba:
  timemachine-run-samba:

networks:
  timemachine:

smb.conf

[global]
   access based share enum = no
   hide unreadable = no
   inherit permissions = no
   load printers = no
   log file = /var/log/samba/log.%m
   logging = file
   max log size = 1000
   security = user
   server min protocol = SMB2
   server role = standalone server
   smb ports = 445
   workgroup = WORKGROUP
   vfs objects = acl_xattr fruit streams_xattr
   fruit:aapl = yes
   fruit:nfs_aces = yes
   fruit:model = TimeCapsule8,119
   fruit:metadata = stream
   fruit:veto_appledouble = no
   fruit:posix_rename = yes
   fruit:wipe_intentionally_left_blank_rfork = yes
   fruit:delete_empty_adfiles = yes

[mars]
   path = /opt/mars
   inherit permissions = no
   read only = no
   valid users = timemachine
   vfs objects = acl_xattr fruit streams_xattr
   fruit:time machine = yes
   fruit:time machine max size =

[venus]
   path = /opt/venus
   inherit permissions = no
   read only = no
   valid users = timemachine
   vfs objects = acl_xattr fruit streams_xattr
   fruit:time machine = yes
   fruit:time machine max size =

Then the various partitions on the hard drive are mounted in /etc/fstab:

proc            /proc           proc    defaults          0       0
LABEL=mars   /media/mars   ext4 defaults,nofail,auto 0 2
LABEL=venus  /media/venus  ext4 defaults,nofail,auto 0 2

Container Logs

Nothing out of the ordinary errors that also appear for the mount points that are working:

timemachine    | error in mds_init_ctx for: /opt/mars
timemachine    | _mdssvc_open: Couldn't create policy handle for mars
timemachine    | Failed to fetch record!
timemachine    | pcap cache not loaded

But those appear for the other mount points that are working too from time to time.

hollie commented 3 years ago

mbentley commented 3 years ago

I would say that checking /var/log/message and the output of dmesg could also lead to any additional clues as to what is going on.

hollie commented 3 years ago

Hey Matt,

you have a point, a relevant entry from /var/log/messages from the time after the last successful backup worked was:

Apr  1 22:30:10 pi03 kernel: [ 5439.278011] sd 0:0:0:0: [sda] tag#6 uas_eh_abort_handler 0 uas-tag 12 inflight: OUT
Apr  1 22:30:10 pi03 kernel: [ 5439.278033] sd 0:0:0:0: [sda] tag#6 CDB: opcode=0x8a 8a 00 00 00 00 00 c4 34 80 00 00 00 01 88 00 00
Apr  1 22:30:10 pi03 kernel: [ 5439.278405] sd 0:0:0:0: [sda] tag#5 uas_eh_abort_handler 0 uas-tag 11 inflight: OUT
Apr  1 22:30:10 pi03 kernel: [ 5439.278422] sd 0:0:0:0: [sda] tag#5 CDB: opcode=0x8a 8a 00 00 00 00 00 c5 23 ef e0 00 00 00 30 00 00
Apr  1 22:30:12 pi03 kernel: [ 5441.118027] sd 0:0:0:0: [sda] tag#9 uas_eh_abort_handler 0 uas-tag 1 inflight: OUT
Apr  1 22:30:12 pi03 kernel: [ 5441.118047] sd 0:0:0:0: [sda] tag#9 CDB: opcode=0x8a 8a 00 00 00 00 00 c4 34 7d 88 00 00 02 78 00 00
Apr  1 22:30:12 pi03 kernel: [ 5441.158050] scsi host0: uas_eh_device_reset_handler start

I'll report back if I can let my setup run for a few days without issues without idling the drive. Then I can continue testing with the hd-idle re-enabled.

Hollie.

hollie commented 3 years ago

Just as an update: after removing the hd-idle script the drive keeps spinning (of course) and no longer results in the folder being unaccessible from within the docker container. So in the end the issue was not related to the timemachine docker image itself but due to an effect that the mounted drive becomes unavailable to docker as a whole after spinning up and down between active and standby state.

The drive is a LaCie d2 Professional USB 3.1-C 8TB running the latest firmware according to the firmware updater tool (04/2021).

I hope this helps somebody else who is experiencing the same issue.

mbentley / docker-timemachine

Share on external USB drive on RPi: folder content unavailable after a while #86