pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

on 8 cores CPU fclones seems not to use all cores #158

Closed kapitainsky closed 1 year ago

kapitainsky commented 1 year ago

When trying to find out why yadf is faster than fclones I have noticed much better CPU utilisation in yadf.

macOS 12.5.1 CPU: 2.3 GHz 8-Core Intel Core i9 SSD - internal NVME

and not surprisengly yadf usually finishes duplicates search 2x faster than fclones

Why - no idea. Maybe it scales better with number of processes? You used 4 core CPU, yadf tests used 6 core and I used 8 core. Also I use macOS when two other tests were run in Ubuntu. We all most likely used different SSD. Too many factors to say for sure why:)

Also I tested the latest versions. In your tests yadf was v0.13 when now it is v1.0

Actually I have idea. I looked at htop because yadf is faster in any test I tried... yadf is using all my 8 cores vs fclones only using 4. Which explains why it is about x2 faster

With fclones I can see thast all 8 cores are used during initial stages - but then only 4 when contents hashes are calculated.

Originally posted by @kapitainsky in https://github.com/pkolaczk/fclones/issues/153#issuecomment-1239549184

kapitainsky commented 1 year ago

Not very scientific but gives good feel what is going on. The same dataset:

yadf v1.0.0 - command to run yadf

yadf

fclones v0.27.2 - command to run fclones group .

fclones

it is a bit dirty animated gif capture - to restart sync both of them refresh browser window - https://github.com/pkolaczk/fclones/issues/158#issuecomment-1239619487 - they are not exactly the same time and slowly get out of sync

It is clear that both programs use all cores initiallyfor grouping (lengths for fclones (yadf does not do it) then prexid and suffix for both followed by crunching contents hashes). What is also insteresting that even in initial phase yadf uses all available CPU power - fclones takes it easy

kapitainsky commented 1 year ago

I think I should stop using it - hahah - as I only create issues:)

kapitainsky commented 1 year ago

more multi core is becoming norm - in Apple world basic MacBook Air has 10 cores and Mac Studio as of today 20 - it will obviously only grow in the future so it is good idea to use what is available well - does not matter if 2 or 100

pkolaczk commented 1 year ago

I think I should stop using it - hahah - as I only create issues:)

It's been great! Keep going!

kapitainsky commented 1 year ago

I think I should stop using it - hahah - as I only create issues:)

Be carefull what you wish for. But I need good deduplicator so will be pestering for time being.

kapitainsky commented 1 year ago

fclones has the biggest potential to be close to perfect

kapitainsky commented 1 year ago

Before I thought that jdupes is the best (it is still great) - fantastic attention to details and any person trying to learn C should study its code- but they ignored macOS world and think that SSD are only used in some high end hardware plus believe (there is no other word) that byte by byte files comparison is the best - so now bet is on fclones:) It has good base design - configurable parallelism with clever heuristics - it makes default easier. And attention to details similar to jdupes. Rest is about filling gaps.

kapitainsky commented 1 year ago

I think good news. I have tried with -t 8 option and then it looks much better.

Overall CPU utilisation is lower than for yadf but now fclones finishes in the same time as yadf (75s in my test). With defaults it takes 150s for fclones to finish - 2 times slower on the spot.

So problem is with threads heuristics.

kapitainsky commented 1 year ago

Implementing https://github.com/pkolaczk/fclones/issues/159 would help to diagnose it + would let users to make more informed decision before going into tuning.

IMHO it should show available cores and threads values used - either based on decision made by fclones for detected device(s) or user provided options

kapitainsky commented 1 year ago

I think I have an answer.

  1. Above default results use unknown device threads defaults (4,1) - which is consistant with quick test I did timing fclones group . -t 1, fclones group . -t 4 and fclones group . -t 8

  2. Problem is probably here:

https://github.com/pkolaczk/fclones/blob/7edbc68d136c6accaa08a1b4bb3482e8265acbb8/src/device.rs#L210

as on macOS devices are named like this: /dev/diskXsY e.g., /dev/disk1s5

EDIT: ok - this code is only for linux but it also should be used for macOS - as going by names is wrong in case of this OS. it is "mentally" much closer to linux than to windows. but has its quirks.

pkolaczk commented 1 year ago

That regex is a hack to get the device name from the partition name and is called on Linux only. On other systems it uses the raw device name it obtained from enumerating the device list. But maybe some of that device listing logic is wrong or some data is missing there, and then it can't find the device for each file.

kapitainsky commented 1 year ago

yes does not help I do not know rust:) but I think the issue is exactly with device listing logic. As I did some crude debug and I can see that on macOS first device is always unknown - they it is correct. Do you use first device from the list as "/" ?

pkolaczk commented 1 year ago

Yes, the first device is the default one that is always unknown - this is just a fallback if no devices are retrieved from the system. But then there is the mount_points vector that should containt the real mount points with device index and should point to some real device.

kapitainsky commented 1 year ago

ok... so this might be an issue becasue on macOS mount points are not always what they seem to be - APFS uses concept of firmlinks when magically one "partition" becomes one with another. It only applies to operating system disk - internal one. External disks work pretty much the same as in Linux.

kapitainsky commented 1 year ago

I think I could figure out what is going on if fclones had some debug output

kapitainsky commented 1 year ago

e.g., which device it assumed that folder used for fclones group is used

pkolaczk commented 1 year ago

Yeah, I'll add some

kapitainsky commented 1 year ago

if you could add all enumerated devices list with as much info as fclones is using would help. I looked at sysinfo crate and for macOS they do a bit of hacking - which might lead to fclones getting funny info.

kapitainsky commented 1 year ago

APFS macOS disk has multiple volumes/"partitions"

folder /Users (and any other user writable folders) data is stored in "partition" called "Macintosh HD - Data" mounted to /System/Volumes/Data

this is mount point and partition name fclones discovers via sysinfo and correctly marks as SSD

but from user perspective (and fclones) folder Users has path /Users/ - so any data there for fclones is on unknown device

this is macOS file system way - multiple "partitions" are fused together via firmlinks into one filesystem

kapitainsky commented 1 year ago

later when I have a moment I will try to think what would be the safest way to tackle it. Especially that fclones can be also used on older macs where things are different.

kapitainsky commented 1 year ago

I think I got to the bottom of it. Testing my first rust code now:) will do PR tomorrow.

kapitainsky commented 1 year ago

Here results of my debug,

DiskDevices structure on macOS contains the following:

device_name = "VM"
file_system = "apfs"
mount_point = "/System/Volumes/VM"
type = SSD

device_name = "Preboot"
file_system = "apfs"
mount_point = "/System/Volumes/Preboot"
type = SSD

device_name = "Update"
file_system = "apfs"
mount_point = "/System/Volumes/Update"
type = SSD

device_name = "Macintosh HD - Data"
file_system = "apfs"
mount_point = "/System/Volumes/Data"
type = SSD

device_name = "extSamsung-SSD"
file_system = "apfs"
mount_point = "/Volumes/extSamsung-SSD"
type = SSD

device_name = "extSamsung-SSD-Temp"
file_system = "apfs"
mount_point = "/Volumes/extSamsung-SSD-Temp"
type = SSD

device_name = "Lacie01-Data"
file_system = "apfs"
mount_point = "/Volumes/Lacie01-Data"
type = HDD

device_name = "Lacie01-TM01"
file_system = "apfs"
mount_point = "/Volumes/Lacie01-TM01"
type = HDD

device_name = "Untitled 1"
file_system = "exfat"
mount_point = "/Volumes/USB stick"
type = HDD

in this case mix of system and external devices.

We do not have to worry about VM, Preboot nor Update - they are only used by system. Neither external devices - I only tried to see if their type is recognised correctly.

The problem is with:

device_name = "Macintosh HD - Data"
file_system = "apfs"
mount_point = "/System/Volumes/Data"
type = SSD

This is device where users all data is... however users never see nor use this path, neither does fclones. APFS uses firmlinks (in Apple own words: "Bi-directional wormhole in path traversal. Firmlinks are used on the system volume to point to the user data on the data volume."). From user perspective data device folders are part of root filesystem. - https://www.swiftforensics.com/2019/10/macos-1015-volumes-firmlink-magic.html

Very APFS specific - but it is what makes fclones unable to recognise correct device. As e.g. when trying to deduplicate folder /Users/kptsky/FilesForDedup there is no such device path in DiskDevices and as a result 'unknown' device is used.

Solution? For APFS we have to help and point 'Macintosh HD - Data' to root folder.

sending PR with my proposed solution.

kapitainsky commented 1 year ago

Thank you for merging it. Now it flies on macOS with defaults.

pkolaczk commented 1 year ago

Thank you for all the hard work on this. Awesome contribution!

pkolaczk commented 1 year ago

After thinking about this issue more, I came to the conclusion the approach taken by fclones to identifying disks by file paths is fundamentally broken. I believe a lot better way would be to use the device identifiers we already have in the FileId / FileInfo and use them to find the actual device, instead of messing up with mount points. That information should be way more reliable as it comes from the OS directly. Unfortunately I haven't found a good, portable way to map those ids to device names returned by sysinfo. I opened a feature request in sysinfo, let's see what they respond.

kapitainsky commented 1 year ago

As it is now it is good enough IMHO. I have looked at sysinfo source code for macOS part and it is full of hacks as well. Cross platform solutions are never easy:)