hydrusvideodeduplicator / hydrus-video-deduplicator

Video Deduplicator for the Hydrus Network
https://hydrusvideodeduplicator.github.io/hydrus-video-deduplicator/
MIT License
41 stars 7 forks source link

Suggestion: Supplemental Usage info and helpful tips for newer users #13

Closed micnorian14 closed 1 year ago

micnorian14 commented 1 year ago

After some fiddling around I did eventually get things to work as intended. At first, running the app as suggested with no query attached will attempt to read every "system:filetype is video" in your entire library. For users with +1000 webm/mp4 files this results in a queue several days long. I found that adding arguments like --query="system:duration < 2s" were especially helpful for initial prototyping and later narrowing searches down. I suggest including this by default or at least mentioning it on the usage page.

I'd also like to suggest a few tips since few (if any) users have used the duplicate processing function for video files before now. Most of this is common knowledge but including it for context and clarity seems appropriate.

1) Most duplicate videos scraped from various boorus will be re-encoded or compressed copies (usually in webm format) that have been scraped from other websites that likely compressed it for streaming purposes. This process is lossy at best - resulting in progressively lower quality copies of the same file being distributed across these websites. 2) A source-quality file will almost always be a much larger size and/or higher bitrate mp4 or webm 3) When sorting files in the duplicate processor; be wary of same-quality alts or a creators updated/corrected release (usually dated the following month) 4) Many 3rd party edits will simply add or replace audio 5) Most booru style websites compress videos to webm, so finding the mp4 version of a video likely means it has not been compressed - indicating it is a source-quality file. This has been true in every instance I've found, but there may be websites that still re-encode videos in mp4 6) Occasionally some chad will take a source quality file and interleave it to a higher framerate with better compression settings/codec. This results in a source-quality video but with a perceived higher framerate. These versions can be smaller in filesize by as much as 30%! 7) Most authors distribute higher quality versions of their content to paid supporters, usually omitting their watermark. Files without watermarks are almost always higher quality as a result.

appleappleapplenanner commented 1 year ago

Thanks for the suggestions.

The defaults should remain as simple as possible. By adding more constraining defaults, I would need to add a parameter to remove defaults which makes people have to read more documentation, read more of the --help, etc. I want things to be as simple, and more importantly as straightforward as possible. There is no magic here; if you have 100k video files you will have to hash all of them if you want to sort them all for duplicates. How you choose to do that is and should be up to you.

I would like to improve the chunking while processing, more specifically the perceptual hashing. But those improvements would probably be silent to the user. There's no getting around hashing the videos you want to hash. But, if there are issues with things like running out of memory, crashes, etc that would be a legitimate concern and should be reported to me with as much information as possible.

I will add some tips about video comparison, but really it's just if one looks better than the other then archive that. If you have a shit ton of files and you aren't trying to archive them then it's as simple as that. People who care more about what videos they want to save probably have already invested the time into learning those things. The Hydrus duplicates page already has far more information than I would like to write. I'll add a link to it in Usage or the FAQ.

What fiddling around did you have to do to get the program working if you remember? I want the documentation to be as clear as possible since people already have to go through the hassle of installing WSL. I will probably end up adding a script to install WSL and the program for users who just want it to work.

GoAwayNow commented 1 year ago

I want the documentation to be as clear as possible since people already have to go through the hassle of installing WSL. I will probably end up adding a script to install WSL and the program for users who just want it to work.

Yeah, that might be helpful. Nothing I've tried to do has allowed me to run this under WSL. I absolutely cannot seem to get a functioning version of Python above 3.9 on Ubuntu 20.04.

appleappleapplenanner commented 1 year ago

Yeah, that might be helpful. Nothing I've tried to do has allowed me to run this under WSL. I absolutely cannot seem to get a functioning version of Python above 3.9 on Ubuntu 20.04.

Try this guide. Ubuntu should come with Python 3.10 by default though, so I'm guessing you installed your distro a while ago?

I can't stand Ubuntu's package management (and therefore 99% of Ubuntu) so I probably can't help you much past that.

micnorian14 commented 1 year ago

In regards to WSL

appleappleapplenanner commented 1 year ago

In regards to WSL

* You need to disable NTFS compression on the folder C:\Users\username\AppData\Local\Temp Otherwise networking may not initialize. Dunno if this bug was fixed or not.

* If you don't use WSL very often and get errors trying to install dependencies; It's because you haven't apt-get install update&upgrade in a while

* SSL error upon launch was due to the program preferring HTTPS over HTTP (which it should) Enable HTTPS in the Hydrus client-api settings

Interesting. I've always read that it's not good to use NTFS compression in general for a number of reasons, so I've never tried it, ESPECIALLY on entire drives. I don't trust it. Check out this project to compress specific directories if you're interested, it's really good!

I'll tell users to update and upgrade before installation like every other program does.

SSL is already on the Wiki under Usage, but I just added a second bullet clarifying.

GoAwayNow commented 1 year ago

Try this guide.

I did try deadsnakes. Got 3.10 installed, but couldn't get it to work. Couldn't get pip to install to it. I wasn't aware of the -full packages they offer, though.

I'm guessing you installed your distro a while ago?

A few years, actually.

I'll tell users to update and upgrade before installation like every other program does.

I did that, too. Didn't advance Python versions for me. sudo apt update && sudo apt upgrade -y It's possible I did this in the wrong order at first, but I ended up doing it a second time later out of desperation, and I'm sure when I uninstalled the distro entirely I was still stuck on 3.8