Closed kapitainsky closed 1 year ago
Not very scientific but gives good feel what is going on. The same dataset:
yadf v1.0.0 - command to run yadf
fclones v0.27.2 - command to run fclones group .
it is a bit dirty animated gif capture - to restart sync both of them refresh browser window - https://github.com/pkolaczk/fclones/issues/158#issuecomment-1239619487 - they are not exactly the same time and slowly get out of sync
It is clear that both programs use all cores initiallyfor grouping (lengths for fclones (yadf does not do it) then prexid and suffix for both followed by crunching contents hashes). What is also insteresting that even in initial phase yadf uses all available CPU power - fclones takes it easy
I think I should stop using it - hahah - as I only create issues:)
more multi core is becoming norm - in Apple world basic MacBook Air has 10 cores and Mac Studio as of today 20 - it will obviously only grow in the future so it is good idea to use what is available well - does not matter if 2 or 100
I think I should stop using it - hahah - as I only create issues:)
It's been great! Keep going!
I think I should stop using it - hahah - as I only create issues:)
Be carefull what you wish for. But I need good deduplicator so will be pestering for time being.
fclones has the biggest potential to be close to perfect
Before I thought that jdupes is the best (it is still great) - fantastic attention to details and any person trying to learn C should study its code- but they ignored macOS world and think that SSD are only used in some high end hardware plus believe (there is no other word) that byte by byte files comparison is the best - so now bet is on fclones:) It has good base design - configurable parallelism with clever heuristics - it makes default easier. And attention to details similar to jdupes. Rest is about filling gaps.
I think good news. I have tried with -t 8
option and then it looks much better.
Overall CPU utilisation is lower than for yadf but now fclones finishes in the same time as yadf (75s in my test). With defaults it takes 150s for fclones to finish - 2 times slower on the spot.
So problem is with threads heuristics.
Implementing https://github.com/pkolaczk/fclones/issues/159 would help to diagnose it + would let users to make more informed decision before going into tuning.
IMHO it should show available cores and threads values used - either based on decision made by fclones for detected device(s) or user provided options
I think I have an answer.
Above default results use unknown device threads defaults (4,1) - which is consistant with quick test I did timing fclones group . -t 1
, fclones group . -t 4
and fclones group . -t 8
Problem is probably here:
https://github.com/pkolaczk/fclones/blob/7edbc68d136c6accaa08a1b4bb3482e8265acbb8/src/device.rs#L210
as on macOS devices are named like this: /dev/diskXsY
e.g., /dev/disk1s5
EDIT: ok - this code is only for linux but it also should be used for macOS - as going by names is wrong in case of this OS. it is "mentally" much closer to linux than to windows. but has its quirks.
That regex is a hack to get the device name from the partition name and is called on Linux only. On other systems it uses the raw device name it obtained from enumerating the device list. But maybe some of that device listing logic is wrong or some data is missing there, and then it can't find the device for each file.
yes does not help I do not know rust:) but I think the issue is exactly with device listing logic. As I did some crude debug and I can see that on macOS first device is always unknown - they it is correct. Do you use first device from the list as "/" ?
Yes, the first device is the default one that is always unknown - this is just a fallback if no devices are retrieved from the system. But then there is the mount_points
vector that should containt the real mount points with device index and should point to some real device.
ok... so this might be an issue becasue on macOS mount points are not always what they seem to be - APFS uses concept of firmlinks when magically one "partition" becomes one with another. It only applies to operating system disk - internal one. External disks work pretty much the same as in Linux.
I think I could figure out what is going on if fclones had some debug output
e.g., which device it assumed that folder used for fclones group
is used
Yeah, I'll add some
if you could add all enumerated devices list with as much info as fclones is using would help. I looked at sysinfo crate and for macOS they do a bit of hacking - which might lead to fclones getting funny info.
APFS macOS disk has multiple volumes/"partitions"
folder /Users (and any other user writable folders) data is stored in "partition" called "Macintosh HD - Data" mounted to /System/Volumes/Data
this is mount point and partition name fclones discovers via sysinfo and correctly marks as SSD
but from user perspective (and fclones) folder Users has path /Users/ - so any data there for fclones is on unknown device
this is macOS file system way - multiple "partitions" are fused together via firmlinks into one filesystem
later when I have a moment I will try to think what would be the safest way to tackle it. Especially that fclones can be also used on older macs where things are different.
I think I got to the bottom of it. Testing my first rust code now:) will do PR tomorrow.
Here results of my debug,
DiskDevices structure on macOS contains the following:
device_name = "VM"
file_system = "apfs"
mount_point = "/System/Volumes/VM"
type = SSD
device_name = "Preboot"
file_system = "apfs"
mount_point = "/System/Volumes/Preboot"
type = SSD
device_name = "Update"
file_system = "apfs"
mount_point = "/System/Volumes/Update"
type = SSD
device_name = "Macintosh HD - Data"
file_system = "apfs"
mount_point = "/System/Volumes/Data"
type = SSD
device_name = "extSamsung-SSD"
file_system = "apfs"
mount_point = "/Volumes/extSamsung-SSD"
type = SSD
device_name = "extSamsung-SSD-Temp"
file_system = "apfs"
mount_point = "/Volumes/extSamsung-SSD-Temp"
type = SSD
device_name = "Lacie01-Data"
file_system = "apfs"
mount_point = "/Volumes/Lacie01-Data"
type = HDD
device_name = "Lacie01-TM01"
file_system = "apfs"
mount_point = "/Volumes/Lacie01-TM01"
type = HDD
device_name = "Untitled 1"
file_system = "exfat"
mount_point = "/Volumes/USB stick"
type = HDD
in this case mix of system and external devices.
We do not have to worry about VM, Preboot nor Update - they are only used by system. Neither external devices - I only tried to see if their type is recognised correctly.
The problem is with:
device_name = "Macintosh HD - Data"
file_system = "apfs"
mount_point = "/System/Volumes/Data"
type = SSD
This is device where users all data is... however users never see nor use this path, neither does fclones. APFS uses firmlinks (in Apple own words: "Bi-directional wormhole in path traversal. Firmlinks are used on the system volume to point to the user data on the data volume."). From user perspective data device folders are part of root filesystem. - https://www.swiftforensics.com/2019/10/macos-1015-volumes-firmlink-magic.html
Very APFS specific - but it is what makes fclones unable to recognise correct device. As e.g. when trying to deduplicate folder /Users/kptsky/FilesForDedup there is no such device path in DiskDevices and as a result 'unknown' device is used.
Solution? For APFS we have to help and point 'Macintosh HD - Data' to root folder.
sending PR with my proposed solution.
Thank you for merging it. Now it flies on macOS with defaults.
Thank you for all the hard work on this. Awesome contribution!
After thinking about this issue more, I came to the conclusion the approach taken by fclones to identifying disks by file paths is fundamentally broken. I believe a lot better way would be to use the device identifiers we already have in the FileId
/ FileInfo
and use them to find the actual device, instead of messing up with mount points. That information should be way more reliable as it comes from the OS directly. Unfortunately I haven't found a good, portable way to map those ids to device names returned by sysinfo. I opened a feature request in sysinfo, let's see what they respond.
As it is now it is good enough IMHO. I have looked at sysinfo source code for macOS part and it is full of hacks as well. Cross platform solutions are never easy:)
When trying to find out why yadf is faster than fclones I have noticed much better CPU utilisation in yadf.
macOS 12.5.1 CPU: 2.3 GHz 8-Core Intel Core i9 SSD - internal NVME
and not surprisengly yadf usually finishes duplicates search 2x faster than fclones
Actually I have idea. I looked at htop because yadf is faster in any test I tried... yadf is using all my 8 cores vs fclones only using 4. Which explains why it is about x2 faster
With fclones I can see thast all 8 cores are used during initial stages - but then only 4 when contents hashes are calculated.
Originally posted by @kapitainsky in https://github.com/pkolaczk/fclones/issues/153#issuecomment-1239549184