hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.34k stars 152 forks source link

Video deduplication support #160

Open HASJ opened 8 years ago

AeliusSaionji commented 8 years ago

I think that's what he does for gif already, so I'd say yes.

imtbl commented 4 years ago

Hey, issues here have been reenabled and will be used for bug reports/feature requests in the future.

We are currently clearing out all the existing issues from 2017 and prior. If your bug report or feature request is still relevant, please create a new issue (or use one of the other channels; see https://hydrusnetwork.github.io/hydrus/).

bbappserver commented 4 years ago

@CuddleBear92 Hydrus can't pdiff image sequences, this is still an issue.

imtbl commented 4 years ago

@bbappserver Reopened.

pozieuto commented 4 years ago

I think there is a way that exact duplicates could be detected very easily: hashing the raw audio and video bitstreams. This would detect files with identical audio and video streams that are just in different containers. It would also detect files that share at least one audio or video stream while still having other streams that differ or are absent. (I have had cases of this, usually when a video had audio added to it later.)

CuddleBear92 commented 4 years ago

Well the current HyDev plans IIRC is to pull "interesting frames" from the video and compare them. Not sure what that means to be exact. Its not just i-frames.

Pulling i-frames is prob also a decent option.

Going for video and frames first makes the most sense as not everything has audio.

imtbl commented 4 years ago

I think there is a way that exact duplicates could be detected very easily: hashing the raw audio and video bitstreams.

As you said, this would only work in a very limited way for exact duplicates; but we also want to be able to detect if, e.g., a 10 second GIF is part of a 3 minute music video encoded in H.264.

Implementing a generalized solution that should work for animated image sequences of any type and length (like the one HyDev is already planning, as Cuddle mentioned) imo makes the most sense here.

See also https://link.springer.com/article/10.1186/s13640-019-0442-7 (similar concept to what HyDev is planning, I think; though in a distributed approach).

ShadowJonathan commented 4 years ago

Well the current HyDev plans IIRC is to pull "interesting frames" from the video and compare them.

Keyframes?

imtbl commented 4 years ago

@ShadowJonathan Partially, yes. But not all interesting frames might actually be keyframes in the video.

CuddleBear92 commented 4 years ago

Where gonna comment on the word keyframe. Cause keyframes is only used in the production itself. You are prob thinking about i-frames in the encoding, frames that has the full quality.

But aslong as he pulls the combined frame in the end then it really doesnt matter what and where he pulls outside of it triny to be interesting.... welll dont even need to do that. Should just have enough of the timeline.

DonaldTsang commented 4 years ago

I might be able to give some clues.

  1. There is this thing called "scene detection" which dissects videos into scenes, then it would be easier to store hashes of scene frames, rather than the hashes of all frames, thus reducing the amount of space. https://github.com/Breakthrough/PySceneDetect
  2. Perceptual Video Hashing is also a thing, think pHash but for whole videos. That is still in the paper phase though.
bbappserver commented 4 years ago

Keyframes from the codec are probably a decent idea, but some experimenting could be needed to see how accurate they actually are in practice.

Background

There are basically 3 types of frame in a modern compressed sequence

If you have two video sequences coded differently it is quite possible that their selections of i-frames diverge considerably, and especially if the videos seem to have an only partial overlap, or the quality of one sequence is significantly diminished.

If you fully decode both streams and replay them with your own change threshold, the likelihood that you both decide the same keyframes are important goes up considerably since you're just doing a forward scan through the fully reconstructed sequence, you still pick out a small number of i-frames, but you don't drop any for compression and replace them with b-frames since you actually care about their signatures, unlike in compression, where as long as you can reconstruct the sequence you're happy.

Another problem is that some video coding formats have no or limited concept of i-frames, and if you have an old stream you aren't willing to transcode and/or can't repackage then i-frames will faill.

Illustrating the problem with using i-frames

For a degenerate example of this that demonstrated the principle. Consider the case of a video sequence made of a single i-frame and 500 p-frames. This video is very slow to seek, but can fully reconstruct the sequence at any frame. However there is only a single i-frame for identification.

Suppose that the first sequence was a TV show episode and there exists a second sequence which is the same episode, but with a fade from black frontloaded, its first iframe is entirely black. Suppose it choses its next iframe to be somewhere in the middle of the opening sequence 1 second after where the first sequence starts.

As far as a perceptual hashing algorithm is concerned, these sequences are completely different.

Example: Time shifted videos

This is a 22 second clip of the start of of big buck bunny and and an 8 second clip of that clip starting after the fade from black, this is what H.264 thinks is important

ffmpeg -skip_frame nokey -vsync 0 -i bigbuckclipped.mp4 -f image2 thumbnails-%02d.jpeg

22s (6 iframes) frames 8s (1 i-frame) thumbnails-01

I think maybe more experimenting is needed with more synthetic videos to see how closely these align, resizing, different codecs alternate color-spaces (RGB-YUV). Obviously some for this case there is a panning shot and part of the bottom is clipped in the 8 second case, this might be fine as the feature transform of a dhash would probably consider that not too dissimilar.

It does however illustrate that a slight shift in timing alters considerably which keyframes were considered different for the purposes of compression.

But the million dollar question is, what happens if we throw a clip of a video that has been run through the wringer and a high quality source version of that video, how well do they matchup, and what is a good threshold for considering similarity.

Maybe as long as two sequences have a good number of jump cuts they will always line up.

Example 2: But Muh AVIs

Did you know indexes are optional in some file formats? https://docs.microsoft.com/en-us/windows/win32/directshow/avi-riff-file-reference#avi-index-entries If you ask for i-frames from some older codecs you could zero. In this case you have to calculate your own keyframes anyway. Although if the codec supports iframes just not the container it should work anyway, but that isn't very well documented online, so this needs to be verified.

Pyav equivalent to get keyframe images in memory

It would be inconvenient to grab all the keyframes as jpegs and then read the back, but with pyav you can use libavcodec to stream em. https://pyav.org/docs/stable/cookbook/basics.html#saving-keyframes


import av
import av.datasets

#path = av.datasets.curated('pexels/time-lapse-video-of-night-sky-857195.mp4')
path='bigbuck.mp4'

def get_keyframes(path_:
 with av.open(path) as container:
     # Signal that we only want to look at keyframes.
     stream = container.streams.video[0]
     stream.codec_context.skip_frame = 'NONKEY'

     for frame in container.decode(stream):

         print(frame)

         # We use `frame.pts` as `frame.index` won't make must sense with the `skip_frame`.
         img=frame.to_image()
         yield img
         # img.save(
         #   'night-sky.{:04d}.jpg'.format(frame.pts),
         #   quality=80,
         #)
DonaldTsang commented 4 years ago

@bbappserver what about deriving non-i-frames that are "important" and find a way to simplify them (ala dendere2x), since many frames that are from the same scene would look similar? If the P and B frames have high variation then it could indicate significant change and all we need to do is approximately reconstruct them.

In case we need to compare quality post-deduplication, there should be a separate algorithm, doing de-duplication and quality checking at the same time is not a good idea.

CatPlanet commented 3 years ago

I'm afraid that for consistent results we have to drop the idea of using i-frames included in video file, as i-frames are convenient stops set up by specific encoder and specific settings. Video files (and gifs) could be reencoded in memory using the same encoder so you can grab "real" i-frames but might take some time and memory. Another (derivative) idea is splitting video from i-frame to i-frame and merging content between them into kind of "avarage image", then using this output as "meaningful frame" along with metadata how long originally that clip was - and using it as a weight. It could be used to detect video fragments in different files.

DonaldTsang commented 3 years ago

@CatPlanet that is close to my idea, but if there is a video with low amounts of variations then the number of i-frames would be smaller. And variations can add up really fast.

HASJ commented 3 years ago

Take a frame off the video at 15%, 30% and/or 45%. It may or may not be a keyframe. Shrink that frame to a thumbnail size. Repeat the process for all videos. Compare each video frame set to every other frame set (Video1-15% compares with Video2-15% and so on). If there's at least a 98% chance of the thumbnail being the same, you can chalk it up to be a duplicate. Show the user the videos that are duped and allow them to select which dupe to keep (since dupes can be detected even with different video sizes/bit rates), giving preference to keep the largest file size/longest duration/highest resolution/highest bit rate.

DonaldTsang commented 3 years ago

@HASJ Sometimes the frames can still be too similar at those percentages, so there can be ways of reducing the hash footprint further.

DuendeInexistente commented 2 years ago

Something that could help a lot for quick detection is acoustic IDs. Musicbrainz Picard uses to check if arbitrary sound files are music from a release. I don't know how it works exactly (Could be jut hashing the soundwave) but it works even across disparate bitrates and lengths and it takes a fraction of a second per minute of audio, so it'd be a good first check to make. If the audio matches between two files, move that ahead in the queue and check the video frames earlier.

micnorian14 commented 2 years ago

Smoothbrain here: You process the first 30 frames of any video (or first 24 of a 24FPS) and hash them.

Processing looks like this: if >50% of those frames are >50% similar it's a hit. Sent to the duplication tab as usual. If <50% of those frames are <50% similar it's not a hit and may require a lower similarity search. You tweak those numbers around I think it would tackle most low-quality reencodes in the wild.

PS - I am somewhat concerned with bloated gif versions of webms but that's becoming less of an issue these days. I'd rather spend the processing power bulk converting them to webm and let the dedupe find them.

thebrokenfacade commented 2 years ago

Not too sure if this helps any, but I did find a program that does most of what is wanted for video deduplication. https://www.video-comparer.com/product-features.php

It handles several video formats, it can find slightly scaled/rotated/cropped/etc videos, and can find similar video segments within larger video files (timeline shifting). I think the software is closed sourced, but maybe someone can still figure out how it works.

Hopefully this helps inspire someone who actually knows what they are doing to take another look at implementing video deduplication.

RunningDroid commented 1 year ago

I think there is a way that exact duplicates could be detected very easily: hashing the raw audio and video bitstreams.

Note that FFMPEG supports hashing every frame of the raw audio & video from a file:

$ ffmpeg -i Input.webm -f framehash -
#format: frame checksums
#version: 2
#hash: SHA256
#software: Lavf58.76.100
#tb 0: 21/500
#media_type 0: video
#codec_id 0: rawvideo
#dimensions 0: 1280x720
#sar 0: 1/1
#tb 1: 1/48000
#media_type 1: audio
#codec_id 1: pcm_s16le
#sample_rate 1: 48000
#channel_layout 1: 3
#channel_layout_name 1: stereo
#stream#, dts,        pts, duration,     size, hash
0,          0,          0,        1,  1382400, 1e592ff35bd439b0449e53ec95da36a40be6a268211fda6611f4d4b50a148809
1,        336,        336,      648,     2592, f92d02a0d523b389ee1b41478a90e96e45f18bb2149dbe8656ea788f8d25dbca
1,        984,        984,      960,     3840, 3ae09eec8a20962a863fe8542ad119b04b7ebf96c58d00a5bd87d828106ce7e6
1,       1944,       1944,      960,     3840, e4e01f8cdea1291164b9ef694773b161461ddda8a69418a46840716f227df2bf
0,          1,          1,        1,  1382400, 5a464cea49575a1ab3254dd8944c3c3d78ac1860498a51adc6fdbdd19a8aab45
DuendeInexistente commented 1 year ago

I think there is a way that exact duplicates could be detected very easily: hashing the raw audio and video bitstreams.

Note that FFMPEG supports hashing every frame of the raw audio & video from a file:

[...]

That only helps with identically encoded identical frames, which I can't see happening outside identical files the current primary dedupe catches automatically.

RunningDroid commented 1 year ago

I think there is a way that exact duplicates could be detected very easily: hashing the raw audio and video bitstreams.

Note that FFMPEG supports hashing every frame of the raw audio & video from a file: [...]

That only helps with identically encoded identical frames, which I can't see happening outside identical files the current primary dedupe catches automatically.

It's worse than that, I tested this method & when I compared example1.mp4 to example1.webm (the same file, converted for this test) I got 1.8% similarity in frame hashes. When I compared example1.mp4 to example2.mp4 I got 2.5% similarity.

FFMPEG's MPEG7 video signature support is also a dead end, the file comparison is worthless and if you try to compare files that are more than a minute or so it'll run out of RAM & die.

private02E4 commented 1 year ago

I've more or less solved this problem with a very shoddy C# program. My process is as follows:

  1. Extract every N keyframes (I do every 50 for videos and 10 for gifs) with FFMPEG ffmpeg -i VIDEO_PATH -vf "select=not(mod(n\,EVERY_N_FRAMES))" -vsync vfr -q:v 1 .\%05d.jpg

  2. Hash the extracted keyframes with DCT (pHash) and store them in a database (I'm just dumping these to JSON). Each has is 64bit.

  3. Perform a diffing calculation on all videos (O((NUM_FILES^2)^NUM_FRAMES), slow if not multithreaded).

For each comparison of two files, I take Min( FRAMECOUNT_A, FRAMECOUNT_B ) random samples from both files' frame hashes and do this calculation. I ignore all black frames by comparing a pre-hashed black image, and if the score is > 0.9 I ignore that hash.

If two frames have a similarity above a certain threshold, I add them to a running sum. After all frame hashes are compared, I divide the running sum by the number of hashes included in the sum.

  1. Take all of the scores calculated in the last step, sort them in descending order, and filter out any low scores based on a threshold. Then I just set up a quick UI that shows both videos side-by-side so I can choose what to do with high-similarity files.

--

My tool is held together with popsicle sticks and duct tape, so I'm not going to bother sharing it, but in terms of accuracy works pretty well. I don't know much about Hydrus' DB schema, so I've just been having the tool delete the duplicate files and use Hydrus' maintenance to remove the entries after I'm done. In the future I might try to have it set a file relationship or copy over the tags...

I don't have much experience with pipes, but I think Hydrus could just use FFmpeg to pipe out the frames and then use the data it receives to calculate the hash. That would remove the need for temp files. Then it would just be a matter of setting up/tweaking a scoring system to determine if two videos are a match.

prof-m commented 1 year ago

@private02E4 Interesting, thanks for sharing! Do you compare all of the frame samples from video A against all of the frame samples from video B, to account for slightly different timings? How successful have you found it for comparing different file formats against one another, or different encodings within the same file format?

private02E4 commented 1 year ago

@prof-m So, I have it set to take every 50th frame from a video and I generate those hashes. Then for each frame hash in Video A, I calculate the similarity against all frames in Video B and take the highest comparison score to add to the running sum for that frame.

Encodings and quality difference don't seem to negatively affect this. Neither does, say, Video A being 1 minute long and Video B being a 10 second clip of A. It will still detect that. I think the only time it wouldn't work is if there was a scene cut every ~30 frames - that could be an issue but then you could just hash more frames to compensate for that.

For Video B being a clip of Video A, this works because I have the running sum divided by the number of frames in the comparison. Video A could have 1000 frames, but since video B only has 100 and only the closest-matching frames are added to the score, the running sum is divided by 100. It will still be able to detect that they contain the same content at some point in the file.

I just ran it on a database of about 20k videos. The frame hashing took a few hours because of the sheer quantity of files and overhead of calling ffmpeg tens of thousands of times, but the hash comparisons only took about 20 minutes when I multithreaded it. I'm sure there's plenty of room for optimization too.

One last thing to note: after hashing all of the frames, the JSON database for those 20k videos was only about 45mb, and that's with the overhead of it being UTF-8 encoded. If the database schema is optimized well, I don't see it causing a huge problem with inflation of the sqlite file size.

prof-m commented 1 year ago

@private02E4 very cool, thanks for sharing! It's a simple but evidently really powerful idea you've executed here. Given Hydrus' existing sqlite and deduplication strategies, it does sound like it'd fit right into Hydrus well. Hopefully dev sees this and thinks it's neat!

micnorian14 commented 1 year ago

As this issue (read:feature request) has been ongoing since 2016 I've resorted to third party apps in the interim. They include VideoDuplicateFinder by 0x90d and GridPlayer by vzhd170. Available on their respective GitHub pages. These apps significantly improved my process of manually detecting and deleting dupes. By hand. You could just filter results in Hydrus by tags down to a single page (eg - artist, video, animated, hassound, etc) then process those. dedup app 2

I'll be honest. I have no idea how to handle the iceberg that is extrapolating/detecting/hashing/comparing/presenting duplicate videos for processing within Hydrus. I understand it will probably never happen. I believe a very basic detection method (first frame extract and hash) would be a beneficial baby step in the right direction.

prof-m commented 1 year ago

@private02E4 I had some extra time recently and decided to start looking into implementing the strategy you outlined in Hydrus itself. In doing so, I found some stuff I thought you might find interesting, so I figured I'd share it here.

If my free time and interest continues (which I make no promises on), I'll probably keep working on implementing what you suggested in Python - either as an external tool that uses the Hydrus API, or as a direct fork of Hydrus itself.

Zweibach commented 1 year ago

If you look under the set file relationships section you'll see that you can chose wether to use Hydrus' content merge settings or provide your own from outside Hydrus when setting relationships through the API.

appleappleapplenanner commented 1 year ago

Hey guys, I've been working on my own video deduplicator the last few days. Check it out!

It perceptually hashes every video, stores the hash in a local database, then compares all of them to each other to check for similarity. If they're similar, they're marked through the Hydrus API as potential duplicates so you can go through them in duplicates processing. I think someone mentioned something similar to this in this thread and it works.

I'm using a perceptual hasher from Facebook called vpdq which is used for detecting videos that were deemed harmful so the accuracy is extremely good. I have only seen a few false positives which were alternates.

It works very well on my small library, but some other testers would be great.

micnorian14 commented 1 year ago

Hey guys, I've been working on my own video deduplicator the last few days. Check it out!

It perceptually hashes every video, stores the hash in a local database, then compares all of them to each other to check for similarity. If they're similar, they're marked through the Hydrus API as potential duplicates so you can go through them in duplicates processing. I think someone mentioned something similar to this in this thread and it works.

I'm using a perceptual hasher from Facebook called vpdq which is used for detecting videos that were deemed harmful so the accuracy is extremely good. I have only seen a few false positives which were alternates.

It works very well on my small library, but some other testers would be great.

̶I̶'̶m̶ ̶g̶i̶v̶i̶n̶g̶ ̶i̶t̶ ̶a̶ ̶t̶r̶y̶,̶ ̶u̶n̶f̶o̶r̶t̶u̶n̶a̶t̶e̶l̶y̶ ̶m̶y̶ ̶l̶i̶b̶r̶a̶r̶y̶ ̶i̶s̶ ̶h̶u̶g̶e̶.̶ ̶I̶f̶ ̶I̶'̶m̶ ̶r̶e̶a̶d̶i̶n̶g̶ ̶t̶h̶i̶s̶ ̶r̶i̶g̶h̶t̶,̶ ̶t̶h̶e̶ ̶E̶T̶A̶ ̶f̶o̶r̶ ̶m̶y̶ ̶l̶i̶b̶r̶a̶r̶y̶ ̶t̶o̶ ̶h̶a̶s̶h̶ ̶w̶o̶u̶l̶d̶ ̶t̶a̶k̶e̶ ̶o̶v̶e̶r̶ ̶a̶ ̶w̶e̶e̶k̶.̶ ̶T̶h̶a̶t̶'̶s̶ ̶p̶r̶o̶b̶a̶b̶l̶y̶ ̶d̶u̶e̶ ̶i̶n̶ ̶p̶a̶r̶t̶ ̶t̶o̶ ̶m̶y̶ ̶s̶l̶o̶w̶ ̶h̶a̶r̶d̶w̶a̶r̶e̶.̶ ̶I̶'̶v̶e̶ ̶f̶i̶d̶d̶l̶e̶d̶ ̶a̶r̶o̶u̶n̶d̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶a̶p̶i̶ ̶a̶c̶c̶e̶s̶s̶ ̶t̶r̶y̶i̶n̶g̶ ̶t̶o̶ ̶w̶h̶i̶t̶e̶l̶i̶s̶t̶ ̶"̶s̶y̶s̶t̶e̶m̶:̶f̶i̶l̶e̶s̶i̶z̶e̶ ̶<̶ ̶2̶0̶0̶k̶b̶"̶ ̶o̶r̶ ̶b̶l̶a̶c̶k̶ ̶l̶i̶s̶t̶ ̶"̶s̶y̶s̶t̶e̶m̶:̶f̶i̶l̶e̶s̶i̶z̶e̶ ̶>̶ ̶2̶0̶0̶k̶b̶"̶ ̶t̶o̶ ̶l̶i̶m̶i̶t̶ ̶t̶h̶e̶ ̶p̶r̶o̶g̶r̶a̶m̶ ̶t̶o̶ ̶a̶ ̶s̶a̶m̶p̶l̶e̶ ̶s̶i̶z̶e̶.̶ ̶T̶h̶a̶t̶ ̶w̶a̶y̶ ̶i̶t̶ ̶d̶o̶e̶s̶n̶'̶t̶ ̶t̶r̶y̶ ̶t̶o̶ ̶p̶r̶o̶c̶e̶s̶s̶ ̶g̶i̶g̶a̶b̶y̶t̶e̶s̶ ̶o̶f̶ ̶w̶e̶b̶m̶/̶m̶p̶4̶ ̶f̶i̶l̶e̶s̶.̶ ̶U̶n̶f̶o̶r̶t̶u̶n̶a̶t̶e̶l̶y̶ ̶I̶ ̶g̶e̶t̶ ̶e̶r̶r̶o̶r̶s̶ ̶a̶b̶o̶u̶t̶ ̶p̶e̶r̶m̶i̶s̶s̶i̶o̶n̶s̶ ̶w̶h̶e̶n̶ ̶I̶ ̶d̶o̶ ̶t̶h̶a̶t̶.̶ ̶I̶m̶p̶l̶e̶m̶e̶n̶t̶i̶n̶g̶ ̶c̶u̶s̶t̶o̶m̶i̶z̶a̶b̶l̶e̶ ̶t̶a̶g̶s̶ ̶i̶n̶t̶o̶ ̶t̶h̶e̶ ̶p̶r̶o̶g̶r̶a̶m̶ ̶(̶o̶r̶ ̶g̶e̶t̶t̶i̶n̶g̶ ̶i̶t̶ ̶t̶o̶ ̶w̶o̶r̶k̶ ̶w̶i̶t̶h̶ ̶t̶h̶e̶ ̶b̶u̶i̶l̶t̶ ̶i̶n̶ ̶a̶p̶i̶ ̶a̶c̶c̶e̶s̶s̶ ̶t̶a̶g̶s̶)̶ ̶w̶o̶u̶l̶d̶ ̶a̶l̶l̶o̶w̶ ̶p̶e̶o̶p̶l̶e̶ ̶w̶i̶t̶h̶ ̶l̶a̶r̶g̶e̶ ̶l̶i̶b̶r̶a̶r̶i̶e̶s̶ ̶t̶o̶ ̶i̶t̶e̶r̶a̶t̶e̶ ̶w̶i̶t̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶r̶e̶ ̶s̶p̶e̶c̶i̶f̶i̶c̶ ̶b̶a̶t̶c̶h̶e̶s̶,̶ ̶r̶a̶t̶h̶e̶r̶ ̶t̶h̶a̶n̶ ̶t̶h̶e̶i̶r̶ ̶e̶n̶t̶i̶r̶e̶ ̶l̶i̶b̶r̶a̶r̶y̶ ̶a̶t̶ ̶o̶n̶c̶e̶.̶ ̶I̶t̶ ̶m̶i̶g̶h̶t̶ ̶a̶l̶r̶e̶a̶d̶y̶ ̶b̶e̶ ̶p̶o̶s̶s̶i̶b̶l̶e̶ ̶a̶n̶d̶ ̶I̶ ̶c̶o̶u̶l̶d̶n̶'̶t̶ ̶f̶i̶g̶u̶r̶e̶ ̶i̶t̶ ̶o̶u̶t̶.̶ ̶D̶o̶h̶.̶ ̶T̶h̶i̶s̶ ̶p̶r̶o̶j̶e̶c̶t̶ ̶o̶f̶ ̶y̶o̶u̶r̶s̶ ̶d̶o̶e̶s̶ ̶l̶o̶o̶k̶ ̶p̶r̶o̶m̶i̶s̶i̶n̶g̶!̶ ̶I̶ ̶c̶a̶n̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶t̶h̶i̶s̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶ ̶m̶a̶i̶n̶!̶ ̶

I was not aware I could append arguments like --query="system:filesize < 10MB"

appleappleapplenanner commented 1 year ago

I'm giving it a try, unfortunately my library is huge. If I'm reading this right, the ETA for my library to hash would take over a week. That's probably due in part to slow hardware. I've fiddled around with the api access in trying to only permit "system:filesize < 200kb" to limit the process to a sample size. That way it doesn't try to process gigabytes of webm/mp4 files. Unfortunately I get an error about permissions when I do that. Implementing a whitelist/blacklist into the program (or getting it to work with hydrus's built in one for api access) would allow people with large libraries to iterate with smaller batches. It might already be possible and I couldn't figure it out. Doh. This project of yours does look promising!

You can do all of the functions I can think of that Hydrus can do with regards to filtering directly with the --query command. For instance, you can limit the number of files with --query="system:limit 1000". I will add a quick notice about that for people who want to try it it out on the wiki. You can also have multiple queries by adding multiple --query commands to further filter the files you want, e.g. --query="system:filesize > 10MB" --query="system:limit 1000" --query="system:archive" --query="system:import time < 0 years 0 months 0 days 1 hour"

I'm not sure what you mean by blacklist and whitelist. If you do filter "system:filesize < 5KB" it will blacklist all files greater than 5KB and whitelist all files <5KB. You can also directly blacklist specific files with "system:hash is not ..." And if you want to directly compare two or more files you can do "system:hash is ..." I'm thinking of ways to speed up the import, but they all reduce the flexibility people have with filtering. See the full list of system predicates here.

The errors about permissions is concerning and I strongly suggest you to open an issue with more information so I can look at it. Perhaps your Client API Key permission is too limited because with all permissions I'm able to do searches just fine.

Additionally, at any time you can cancel the perceptual hashing step with CTRL+C or pass the --skip-hashing option and start searching for duplicates with what you have already hashed. But, currently, the queries do not affect the searching of duplicates so that's something I'm looking in to. Also, files are not perceptually hashed more than once unless you use --overwrite so skipping files already in the database as a way of resuming progress should work fine.

I just started working on an update to speed up duplicate searching so hopefully that will help.

I appreciate the support and for you trying it out in this early state.

Please create an issue on the project page if you have further discussion. I don't want to clutter this issue with project specific things.

floogulinc commented 1 year ago

I'm not sure what you mean by blacklist and whitelist.

They're talking about the tag restrictions you can set for an API key but they don't seem to understand how that works.

The errors about permissions is concerning and I strongly suggest you to open an issue with more information so I can look at it. Perhaps your Client API Key permission is too limited because with all permissions I'm able to do searches just fine.

That's because they tried to set restrictions for the API key but when you do that the query needs to match and if you try to do metadata lookups with hashes or not directly after the matching query there will also be errors.

appleappleapplenanner commented 1 year ago

I'm not sure what you mean by blacklist and whitelist.

They're talking about the tag restrictions you can set for an API key but they don't seem to understand how that works.

The errors about permissions is concerning and I strongly suggest you to open an issue with more information so I can look at it. Perhaps your Client API Key permission is too limited because with all permissions I'm able to do searches just fine.

That's because they tried to set restrictions for the API key but when you do that the query needs to match and if you try to do metadata lookups with hashes or not directly after the matching query there will also be errors.

Oh... I didn't even think about people using those. I should probably write something about that in the FAQ.

micnorian14 commented 1 year ago

The only other client-api program I've had experience with before was the Companion App. I was not aware of any arguments for the program like passing on queries etc. The argument --query="system:duration < 1s" does exactly what I wanted. Thank you. This saves me from opening an issue titled: "Include extra documentation for the smooth brained among us"

EDIT: Well it appears to...just work. This is amazing! "Native" video deduplication within hydrus! The processing ETA just says "Heat Death of the Universe" but in smaller batches it is absolutely working just I always envisioned it. Thank you @appleappleapplenanner ! depuper working