mbentley / docker-timemachine

Docker image to run Samba (compatible Time Machine for macOS)
Apache License 2.0
588 stars 67 forks source link

Performance observation #106

Closed bugsyb closed 10 months ago

bugsyb commented 2 years ago

Just to share performance related observation.

Two scenarios, same backing up docker host. The difference is in containers (in fact two variables still): Network usage a) first one is default docker-machine b) second is based on arm64v8/ubuntu:latest and then samba and avahi-daemon installed.

It is not entirely clear for me why is there such a difference in performance outside of - it is noticeable. I'm still in testing phase of trying to get time machine working in terms of migrating old backup, but noticed that behavior and if someone wants to further test - it might be interesting. From repetition perspective, both scenarios are similar - SuperDuper copy of HDD based Time Machine backups to the same host, just different container image and slightly different config - though nothing outside of vfs element I'd suspect. The catia module even if enabled, should have perf impact. In both tests same set of files is read and stored - so pretty repetitive.

I was surprised and thought you guys might want to know about the observations made. Alpine is said to be lighter and as such faster. Am not comparing apples to apples here, though still pretty close.

[global]
access based share enum = no
hide unreadable = no
inherit permissions = no
load printers = no
log file = /var/log/samba/log.%m
logging = file
max log size = 1000
security = user
server min protocol = SMB2
server role = standalone server
smb ports = 445
workgroup = WORKGROUP
vfs objects = acl_xattr fruit streams_xattr
fruit:aapl = yes
fruit:nfs_aces = yes
fruit:model = TimeCapsule8,119
fruit:metadata = stream
fruit:veto_appledouble = no
fruit:posix_rename = yes
fruit:wipe_intentionally_left_blank_rfork = yes
fruit:delete_empty_adfiles = yes
fruit:zero_file_id = yes

Ubuntu +
server string=%hserver(Samba,Ubuntu)
panic action=/usr/share/samba/panic-action%d
obey pam restrictions=yes
unix password sync=yes
passwd program=/usr/bin/passwd%u
passwd chat=*Enter\snew\s*\spassword:*%n\n*Retype\snew\s*\spassword:*%n\n*password\supdated\ssuccessfully*.
pam passwordchange=yes
map to guest=baduser
user share allow guests=yes

[TimeCapsule-myuser]
path = /host/timecapsule
inherit permissions = no
read only = no
valid users = myuser
vfs objects = acl_xattr fruit streams_xattr
fruit:time machine = yes
fruit:time machine max size = 1 T
fruit:zero_file_id = yes

+Ubuntu
comment = MyUserCapsule
browseable = yes
guest ok = no
writable = yes
valid users = myuser
vfs objects = catia fruit streams_xattr
fruit:metadata = stream
fruit:model = MacSamba
fruit:posix_rename = yes
fruit:veto_appledouble = no
fruit:wipe_intentionally_left_blank_rfork = yes
fruit:delete_empty_adfiles = yes
mbentley commented 2 years ago

I've never really dug too deeply into why performance tends to be...disappointing with time machine over the network but I had basically just accepted it and my backups happen at night so it doesn't end up being a big deal for me personally.

I'm not exactly following what is different outside of using Ubuntu and Alpine as I am not sure what diff syntax that's using but I'm always more than happy to try out settings that might make a difference performance wise.

ZetaPhoenix commented 2 years ago

I have also noticed that backups to this container have been much slower compared to backing up to my NAS and had been wanting to see if there were tweaks that could be made as well.

For example a backup to this docker target gave me this: image

My host is x86 and I have looked at usages while a backup was in progress and did not see anything that stood out (CPU was mostly idle and peaking at maybe 10% with other containers).

mbentley commented 2 years ago

I was curious if it might be something about running Samba in general so I just mounted a samba share from my native host OS and from the timemachine container based on alpine and did a basic file copy from a Linux VM. I actually saw better performance from the container for a basic file copy but take that with a grain of salt as the backing disks for the two are different - one is a spinning rust array and the other is NVMe flash but on both of them, the performance was nothing as close to as slow as I see when backing up through time machine.

I would love to see a Samba config from a NAS that has native time machine support to compare the arguments to see if there is anything that can be done to squeeze out some performance if anyone can retrieve one.

ZetaPhoenix commented 2 years ago

I can try and dump my Synology. Is there an easy command to do so or something I can do from my Mac?

mbentley commented 2 years ago

According to this thread, it might just be the default /etc/samba/smb.conf so a cat /etc/samba/smb.conf should do it, assuming the path hasn't changed since 2016.

ZetaPhoenix commented 2 years ago

This is the result.

$ cat /etc/samba/smb.conf
# Copyright (c) 2000-2019 Synology Inc. All rights reserved.
#
#
#                          ______     _______
#                        (  __  \   (  ___  )
#                        | (  \  )  | (   ) |
#                        | |   ) |  | |   | |
#                        | |   | |  | |   | |
#                        | |   ) |  | |   | |
#                        | (__/  )  | (___) |
#                        (______/   (_______)
#
#                   _          _______   _________
#                  ( (    /|  (  ___  )  \__   __/
#                  |  \  ( |  | (   ) |     ) (
#                  |   \ | |  | |   | |     | |
#                  | (\ \) |  | |   | |     | |
#                  | | \   |  | |   | |     | |
#                  | )  \  |  | (___) |     | |
#                  |/    )_)  (_______)     )_(
#
#   _______    _______    ______    _________   _______
#  (       )  (  ___  )  (  __  \   \__   __/  (  ____ \  |\     /|
#  | () () |  | (   ) |  | (  \  )     ) (     | (    \/  ( \   / )
#  | || || |  | |   | |  | |   ) |     | |     | (__       \ (_) /
#  | |(_)| |  | |   | |  | |   | |     | |     |  __)       \   /
#  | |   | |  | |   | |  | |   ) |     | |     | (           ) (
#  | )   ( |  | (___) |  | (__/  )  ___) (___  | )           | |
#  |/     \|  (_______)  (______/   \_______/  |/            \_/
#
#
# IMPORTANT: Synology will not provide technical support for any issues
#            caused by unauthorized modification to the configuration.

[global]
    printcap name=cups
    winbind enum groups=yes
    include=/var/tmp/nginx/smb.netbios.aliases.conf
    encrypt passwords=yes
    min protocol=NT1
    security=user
    local master=no
    realm=*
    syno sync dctime=no
    passdb backend=smbpasswd
    printing=cups
    max protocol=SMB3
    winbind enum users=yes
    load printers=yes
    workgroup=WORKGROUP

Also if this helps:

$ samba -V
Version 4.10.18
mbentley commented 2 years ago

Hmm, doesn't seem like that's the config with time machine enabled. They must have it stored and referenced elsewhere as I doubt it is in the include file mentioned.

ZetaPhoenix commented 2 years ago

Yea, no such file:

$ cat /var/tmp/nginx/smb.netbios.aliases.conf
cat: /var/tmp/nginx/smb.netbios.aliases.conf: No such file or directory
bugsyb commented 2 years ago

@ZetaPhoenix - it might be that this is just "header" with global settings which is then concatenated into final samba config file by a script or something.

Additionally, leaving TM running to system which doesn't do almost anything else (spinning up one small container 3 times a day - so nothing to do with an hourly TM activity).

See below my observation - I/O number of requests B/s and I/O time. In short - given sparse bundle, plus additional impact at my end, as FS (ext4) sits on cryptfs backed by spinning rust aka HDD, means high I/O time where CPU usage is low (load goes up due to I/O time). The mmcblk is FW, container stores data on it ( I know - not the best idea - wanted to make sure I split the writes). IO

All in all, take is that it is hdd head movement time required as sparsebundle is very dispersed and potentially FS level index is scattered across multiple files to which access is slow, i.e. because it's so many files in the folder.

For the moment, best would be to use FS cache and it should be done by default by the Linux processes and buffers/cache management, but somehow it doesn't seem to be happening here. I doubt TM reads that much of random data to not keep enough of it in memory. The system in question has literally that container running only with total of 4GB RAM, so let's say, easily. The size of hourly backup is ~300MB as shown by TM (still quite a lot as nothing was happening on the Mac during these hours).

cache

Other approach is pushing this onto NVMe or other non-spinning rust type of disk.

Does any of you have ZFS with ZIL/SLOG and configured in a way that frequently accessed data sits on flash and rest on HDD? It could give some promising results.

btw. in my case I've noatime,nodiratime,data=ordered (which the ordered portion doesn't help much on encrypted block device anyway, if am not mistaken).

mbentley commented 2 years ago

Mine is ZFS but it isn't with ZIL/SLOG but it is on NVMe and I have a crap load of RAM that is allocated (64 GB) to ARC. I have yet to need a ZIL/SLOG as it would just be idle in my setup. It's not doing anything close to any sort of heavy reads/writes on my NVMe mirror. Reading several forum threads, it seems like people end up at the conclusion that the bottleneck is something in Time Machine itself, whether it is something with how macOS deals with priority or just that it is performing some sort of function that is just extremely slow on the macOS side. People doesn't seem to think it is a configuration or resource issue but nobody seems to really know.

ZetaPhoenix commented 2 years ago

I'm not sure how FS and media are directly related to the end speed in this case.

The Synology NAS is using ext4 as well and is writing to two HDD's in RAID1 where the container is ext4 to a single HDD. The CPU in the x86 also has a significant clock speed and ram advantage compared to the NAS so I think this is a config/setup issue.

If we are looking at bandwidth while a backup is happening to the Docker container: image

I know the HDD in the system can do much faster when connected via USB so I do not think disk IO is the limiter either.

bugsyb commented 2 years ago

Reasoning behind my direction towards "head movement time" was due to increase of Disk I/O time, which reflects how long the request had to wait before was performed. This is directly related the data which could not be written either due to limitation of throughput (not in this case due to limited amounts of data written) or due to time required to write it, which in case of spinning rust, means head movement, given CPU utilization is low. This is the only bottleneck I could think of here - high number of small portion of data to be written at different locations of the disk. @ZetaPhoenix answering on your: I'm not sure how FS and media are directly related to the end speed in this case.

Time machine piece of data goes more or less that way between data requested to be written and that data ending up on the physical storage (I might be missing some elements due to simplification):

Data to be written (mounted sparsebundle) => write to virtual sparse bundle mounted => translation to allocation across potentially multiple bands (small pieces of data scattered across multiple files at FS level visible on Mac via SMB mount) => request to read/write these small pieces of data within multiple files => request goes via SMB to Samba Server => Samba server makes request R/W to specific bands files => here might be catch, if it does the read content of the directory every time before reading file, not keeping handler to it in memory - that might take a lot of time - do ls -l or better full attributes ls on these as Mac might be using it. It takes time... => Samba is in container (not sure if it impacts kernel buffers/cache - hopefully not - didn't read enough on it) => Docker host level request to R/W files through Volume mounted (small and to be ignored delay), => R/W to specific file (or maybe even that folder read which takes time due to huge amount of files there - could be folder caching element), => Ext4 R/W request to block device (cryptFS), => CryptFS R/W request translated (no significant CPU impact), => CryptFS block allocation might be, and most often will be non-linear, => HDD head movement to fulfill the R/W request - that takes m/s for each block and if a lot of random movements wait time before writing even small pieces of data increases.

Ext4 not necessarily has enough awareness of the data to allocate blocks properly to avoid fragmentation as even below, CryptFS for its proper purpose - will have it scattered and fragmented anyway. That all with additional layer of virtual FS (sparsebundle) and it's own data being highly fragmented and scattered across multiple files, create the mess and even lower performance.

There is also Mac level setting, often mentioned on TM related threads, to unlock throttle for initial backup, which by default is enabled. This was done by Apple to have TM running in background and limit impact/visibility of its activity to the user.

This though... doesn't explain the I/O time increase on host side, which only is due to how the data is stored and time taken to store it on Samba server side.

I can't perform test atm, but if would have a chance, would be keen to see: a) I/O time and B/s for NVMe based, straight to device write, i.e. some simple system, b) RAM backed storage - well I don't know how to i.e. limit TM to only backup small folder, i.e. couple of GB to back it up but TMPFS in RAM - that would limit I/O time to close to 0. c) performance/throughput for analogical backup with real Time Capsule.

What might be happening there is that TM asks for "sync" write, hence it takes time. Not sure if Samba protocol allows it, similar to NFS. But, making it up now, forcing Samba server to operate in async mode and helping to have frequently accessed R/W data in buffers/cache of Linux kernel should improve performance significantly.

All in all, it doesn't look like it is CPU ticks on how Samba is written (high CPU utilization, as it is relatively low), but how often R/W blocking activity is taken and it all awaits for confirmation from underlying physical device confirming data is written.

One idea for tests could be... SquashFS the TM folder on Samba host and then mount it with RW overlay with writes to RAM - that could be pretty interesting. Again loads of testing could be down to narrow it, but biggest problem for me is... I won't have chances to fix the code if anything is found in Samba - am not Samba contributor and didn't code for quite some time to avoid issues in the Samba code. Hope is... collectively we could come to some conclusions and then maybe reach out with questions to Samba developers - maybe someone smarter than me could help then.

ZetaPhoenix commented 2 years ago

I can try and capture the bandwidth from the Mac on the next time I back up to the Synology NAS if that helps.

Just as a note, the HDD is only secondary storage and used only by the time machine container. The x86 host does boot off of a NVMe SSD and that FS is where the main containers live. I just pass the HDD as the data mount point.

ZetaPhoenix commented 2 years ago

image Here is some traffic backing up to the NAS.

bugsyb commented 2 years ago

@ZetaPhoenix, do you have access to disk I/O wait time too? Majority of delay is most probably due to these small size writes to HDD, hence having container and rest of NAS on NVMe wouldn't help.

Gut feeling is that maybe TM sends "sync" request via Samba - hence the slowness? By default Samba works in async mode, but client side can request sync type of write. This together with TM being backup solution, I've pretty strong feeling it might be the case. Small debug of Samba would be required to see type of requests being sent.

Here are some details: https://www.systutorials.com/is-samba-sync-or-async-for-writes/

ZetaPhoenix commented 2 years ago

I'm not sure how to measure that. If there is an easy way in MacOS let me know.

bugsyb commented 2 years ago

@ZetaPhoenix - It would need to be done on the NAS side, cause that's where belief is the problem of slowness is. One idea for test is - if you have enough space and push the storage to NVMe to see performance then. It should increase significantly as hdd head movement being the bottleneck would be removed.

Looking at Samba logs and doing some analysis - here is what it shows:

  1. During 2-3 TM sessions large number of files has been accessed, some of them multiple times:

    925 sparsebundle/bands/233,
    927 sparsebundle/bands/1e0,
    1246 sparsebundle/bands/109,
    1308 sparsebundle/bands/1d5e9,
    1510 sparsebundle/bands/1f6,
    1580 sparsebundle/bands/1c,
    1675 sparsebundle/bands/1d5ac,
    2150 sparsebundle/bands/22c,
    3272 sparsebundle/bands/239,
    3440 sparsebundle/bands/236,
    3928 sparsebundle/bands/237,
    4646 sparsebundle/bands/238,

    In total 421 files (1TB limitation set, almost full - 2 years worth of backups).

  2. Looking at offset of accessed files - it is dispersed:

    103 sparsebundle/bands/48, length=8192 offset=1486848 wrote=8192
    103 sparsebundle/bands/88, length=8192 offset=1486848 wrote=8192
    126 sparsebundle/bands/c8, length=8192 offset=1486848 wrote=8192
    127 sparsebundle/bands/28, length=8192 offset=1486848 wrote=8192
    139 sparsebundle/bands/238, length=8192 offset=7229440 wrote=8192
    151 sparsebundle/bands/28, length=8192 offset=1511424 wrote=8192
    174 sparsebundle/bands/1c, length=4096 offset=5636096 wrote=4096
    182 sparsebundle/bands/229, length=8192 offset=5115904 wrote=8192
    218 sparsebundle/bands/1a8, length=8192 offset=1486848 wrote=8192
    234 sparsebundle/bands/188, length=8192 offset=1511424 wrote=8192
    240 sparsebundle/bands/28, length=16384 offset=1486848 wrote=16384
    286 sparsebundle/bands/1e8, length=8192 offset=1486848 wrote=8192
    320 sparsebundle/bands/237, length=8192 offset=856064 wrote=8192
    413 sparsebundle/bands/188, length=8192 offset=1486848 wrote=8192
    470 sparsebundle/bands/19, length=4096 offset=20480 wrote=4096
    847 sparsebundle/bands/1c, length=4096 offset=6729728 wrote=4096
  3. Amounts of data written - is relatively small as per earlier feeling and confirmed to be even smaller - literally 4k/8k blocks at various offset of the file, though surprisingly often, at same offset. This tells us that same location is written multiple times and by the way of deduction, I'd shoot it is sparse bundle FS metadata, especially as these most often accessed are the relatively low number from bands perspective.

  4. There are also some larger portion of data being written too:

    sparsebundle/bands/1d, length=999424 offset=2519040 wrote=999424
    sparsebundle/bands/25, length=999424 offset=6017024 wrote=999424
    sparsebundle/bands/22, length=995328 offset=0 wrote=995328
    sparsebundle/bands/1d5d9, length=995328 offset=7393280 wrote=995328
    sparsebundle/bands/1d345, length=995328 offset=6909952 wrote=995328

    Some of the biggest ones were:

    98304
    98304
    98304
    983040
    991232
    995328
    995328
    995328
    999424
    999424

This all confirms the theory why it is so slow.

The challenge though is, how to make it faster.

Single TM backup is in this case ~360MB and having system with multiple GBs of RAM sitting at idle, take that, risks are taken that backup power is provided, so write to the disk should happen or data will be corrupted - question is how to speed it up?

Other problem might be... that sparsebundle might get broken easily (maybe, that I don't know) and then maybe all investigation is for nothing, as at the end of the day, it is about being able to recover. With that said, sparsebundle can not be that vulnerable as a network comms break might happen.

So, the key is - how to make sure it all operates via RAM and gets down to HDD in the background, in non-blocking (async) fashion.

Ideas? For starters I'd prefer to avoid some crazy overlay systems, which might work for PoC, but would be a no go for production. Hence priority is to identify if there are any Samba or kernel settings to enforce the buffer/cache usage and async operation.

mbentley commented 10 months ago

Closing due to age, let me know if there is anything I can help with.