cbird is a command-line program for finding duplicate images and videos that cannot be found by general methods such as file hashing. Content-Based Image Recognition (CBIR) is used, which examines the pixels of files to get comparable features and "perceptual" hash codes.
The main features are:
Compile it yourself using my detailed notes.
Add execute permission and run
chmod +x cbird-0.7.0-x86_64.AppImage
./cbird-0.7.0-x86_64.AppImage -install # optional install helper
cbird [...]
apt install libfuse2
apt install libopengl0
yum install libglvnd-opengl
cbird -platform wayland-egl [...]
cbird/cbird-mac [...]
Optional: create shortcuts for cbird
Set-Execution-Policy RemoteSigned
New-Item -Type File $PROFILE -Force
OpenWith $PROFILE
Set-Alias -Name cbird -Value C:\cbird\cbird.exe
function cbird-pics {cbird -use $HOME\Pictures $args}
cbird -help
is very detailed
<path>
, caching into <path>/_index
cbird -use <path> -update
cbird -update
cbird -dups -show
cbird -similar -show
cbird -p.dht 1 -similar -show
This is lacking documentation at the moment. But for now...
-show
if there is a selection or results.Common formats are supported, as well as many obscure formats. The available formats will ultimately vary based on the configuration of Qt and FFmpeg.
cbird -about
lists the image and video extensions. Note that video extensions are not checked against FFmpeg at runtime, so they could be unavailable.
Additionally, zip files are supported for images.
To get the most formats you will need to compile FFmpeg and Qt with the necessary options. Additional image formats are also available with kimageformats.
Links are ignored by default. To follow links, use the index option -i.links 1
If the search path contains links, they are only considered when scanning for changes (-update
), otherwise there is no special treatment. For example, deleting a link is the same as any other deletion operation.
Duplicate inodes are not followed by default. If there are duplicate inodes in the tree, the first inode in breadth-first traversal is indexed. To follow all inodes, for example to find duplicate hard links, use -i.dups 1
.
The index stores relative paths (to the indexed/root path), this makes the index stable if the parent directory changes. However, if a path contains links, or is a link itself, it is stored as-is; which may be less stable than the storing the link target. To store the resolved links instead, use i.resolve 1
. This is only possible if the link target is a child of the index root.
Note that cbird does not not prevent broken links from occurring, the link check is temporary during the index update.
The "weed" feature allows fast deletion of deleted files that reappear in the future. A weed record is a pair of file hashes, one is the weed/deleted file, the other is the original/retained file. When the weed shows up again, it can be deleted without inspection (-nuke-weeds
)
-p.mm 1
or -p.eg 1
to force pairsThere is nothing to prevent deletion of the original/retained file, so the weed record can become invalidated. If the original is no longer present, the association can be unset with the "Forget Weed" command.
cbird -weeds -show # show all weeds
cbird -nuke-weeds # delete all weeds
cbird -similar -with isWeed true # isolate weeds in search results
There are a few for power users.
CBIRD_SETTINGS_FILE
overrides the path to the settings file (cbird -about
shows the default)CBIRD_TRASH_DIR
overrides the path to trash folder, do not use the system trash binCBIRD_CONSOLE_WIDTH
set character width of terminal console (default auto-detect)CBIRD_COLOR_CONSOLE
use colored output even if console says no (default auto-detect)CBIRD_FORCE_COLORS
use colored output even if console is not detectedCBIRD_LOG_TIMESTAMP
add time delta to log messagesCBIRD_NO_BUNDLED_PROGS
do not use bundled programs like ffmpeg in the appimage/binary distributionQT_IMAGE_ALLOC_LIMIT_MB
maximum memory allocation for image files (default 256)QT_SCALE_FACTOR
global scale factor for UITMPDIR
override default directory for temporary files; used for opening zip file contentsCBIRD_MAXIMIZE_HACK
set if window manager/qt is not restoring maximized windows (default auto-detect)Check the development notes for known bugs and feature ideas.
Report bugs or request features on github
There are several algorithms, some are better than others depending on the situation.
-p.alg dct
)Uses one 64-bit hash per image, similar to pHash. Very fast, good for rescaled images and lightly cropped images.
-p.alg fdct
Uses DCT hashes centered on scale/rotation invariant features, up to 400 per image. Good for heavily cropped images, much faster than ORB.
-p.alg orb
Uses 256-bit scale/rotation invariant feature descriptors, up to 400 per image. Good for rotated and cropped images, but slow.
-p.alg color
Uses histogram of up to 32 colors (256-byte) per image. Sometimes works when all else fails. This is the only algorithm that finds reflected images, others require -p.refl
and must rehash the reflected image (very slow)
-p.alg video
Uses DCT hashes of video frames. Frames are preprocessed to remove letterboxing. Can also find video thumbnails in the source video since they have the same hash type.
-p.tm 1
Filters results with a high resolution secondary matcher that finds the exact overlap of an image pair. This is most useful to drop poor matches from fdct and orb. Since it requires decompressing the source/destination image it is extremely slow. It can help to reduce the maximum number of matches per image with -p.mm #
Indexing happens when -update
is used. It can take a while the first time, however subsequent updates only consider changes.
Unused algorithms can be disabled to speed up indexing. If you have large images, you may as well enable all algorithms because image decompression dominates the process.
Arguments | Note | Time (seconds) |
---|---|---|
-update | all enabled | 46 |
-i.algos 0 -update | md5 only | 2 |
-i.algos 1 -update | +dct | 41 |
-i.algos 3 -update | +dct features | 44 |
-i.algos 7 -update | +orb features | 44 |
-i.algos 15 -update | +color hist | 46 |
Search speed varies with algorithm. The OpenCV search tree for ORB is quite slow compared to others. It is better suited for -similar-to
to search a smaller subset suspected to have duplicates.
Arguments | Note | Time (milliseconds) |
---|---|---|
-similar | dct | 54 |
-p.alg fdct -similar | dct features | 200 |
-p.alg orb -similar | orb features | 9000 |
-p.alg color -similar | histograms | 450 |
Indexing large sets of smaller images benefits from disabling algorithms.
Arguments | Note | Rate (Img/s) | Time (minutes) |
---|---|---|---|
-i.algos 0 -update | md5 only | 861 | 9:41 |
-i.algos 1 -update | +dct | 683 | 12:11 |
-i.algos 3 -update | +dct features | 377 | 22:04 |
-i.algos 7 -update | +orb features | 348 | 23:56 |
-i.algos 15 -update | +colors | 227 | 36:39 |
For N^2 search (-similar
) only DCT hash is normally practical.
Arguments | Note | Time (s) |
---|---|---|
-p.dht 1 -similar | dct, threshold 1 | 5.5 |
-p.dht 2 -similar | dct, threshold 2 | 5.6 |
-p.dht 3 -similar | dct, threshold 3 | 5.9 |
-p.dht 4 -similar | dct, threshold 4 | 7.1 |
-p.dht 5 -similar | dct, threshold 5 | 8.9 |
For K*N (K needle images, N haystack images) the slower algorithms can be practical even for large datasets. For a quick test we can select and search for the first 10 items:
cbird -p.alg fdct -select-type i -head 10 -similar-to @
Arguments | Note | Time (s) |
---|---|---|
-p.alg dct -p.dht 2 | dct, threshold 2 | 1.3 |
-p.alg fdct -p.dht 7 | dct-features, threshold 7 | 1.5 |
-p.alg orb | orb-features | 84.4[^1] |
-p.alg color | colors | dnf[^2] |
-p.mt
which produces a result for every needle until threshold is exceededi.links
, -i.dups
, -i.resolve
for other cases.-group-by
[^1]: OpenCV search tree only partially cached on disk, slow to start [^2]: Color search lacks a search tree, not suitable for large sets