hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.34k stars 152 forks source link

Associated urls prevent downloading alternative images (by default) #1438

Closed brachna closed 10 months ago

brachna commented 12 months ago

Hydrus version

v543

Qt major version

Qt 6

Operating system

Windows 10

Install method

Extract

Install and OS comments

No response

Bug description and reproduction

Warning: furry NSFW example.

  1. download https://blacked.booru.org/index.php?page=post&s=view&id=65569
  2. now download https://e621.net/posts/4270369?q=smewed
  3. original (2nd) image won't download, Hydrus thinks it's already downloaded and will show 1st image instead as proof

It seems that associated urls check runs on assumption that associated source url has the exact same file. But that's wrong. There's a lot of edits, coloring, translations that cite original unedited image, that's just how internet is. Not to mention that source image can have different quality as well. The only way to download original image now is to go to Import Options->File Tab and set "Check URLs..." to "do not check". Tool-tip warns me in all caps not to do that, but wth else am I supposed to do? I'll have to do that to all my saved subscriptions too.

The only way to safeguard further is to uncheck "associate (and trust) additional source urls" from default file import options. Or to check every single file page parser and delete associated/source url extraction entirely. That would make subscriptions work as they should and spare me paranoia when seeing "already downloaded" messages. Actually they won't, only "do not check" will save from this because a ton of images already have associated source urls and will prevent original/alternative image downloads if their urls are encountered.

I haven't looked into internals, but assume urls have basic checking like a list. urls = ["https://blacked.booru.org/index.php?page=post&s=view&id=65569", "https://e621.net/posts/4270369"] What if instead it was [{"url": "https://blacked.booru.org/index.php?page=post&s=view&id=65569", "checked": True}, {"url": "https://e621.net/posts/4270369", "checked": False}]. Then Hydrus would check the url anyway if "checked" flag is False, download image (regardless if it's dupe or not) and set it to True afterwards.

The default import option right now is a beginner trap. This was a very unpleasant discovery for me as I have already downloaded a lot of images and set up subscriptions. Tool-tip for "associate (and trust) additional source urls" warns to uncheck it if I believe the site supplies bad source urls. That would be every single site in existence (edits, uploader copy-pasted wrong url, image has different resolution, etc).

Log output

No response

hydrusnetwork commented 12 months ago

Thank you for your report.

Unfortunately, yes, hydrus is not yet clever enough to remember which URLs it has actually visited. I think your idea on storing a 'checked' value is a good one, and when I get to the next big URL storage overhaul, I think I will add it. I am not sure when this will be.

Note though that this problem is not so severe. In general, the instances of source being incorrect tend to fall into one of three categories:

  1. An artist profile link, a tumblr blog or twitter link, just to their account homepage.
  2. An altered version of a file pointing to the original master.
  3. A single actually incorrect link.

The good news is we usually only have to worry about Case 3. As you say, the internet is full of weird connections, and I have added some logic to the pre-downloader code that tries to notice some untrustworthy URLs. If a URL appears to refer to multiple files (e.g. let's say you downloaded multiple WIP or costume variants on blacked, and they all pointed to the same master on e621), that e621 URL is discarded in future checks. (This also nullifies Case 1) If a URL points to a file that has another URL at the same domain (e.g. you downloaded the same file from two places and they disagree on its original source on site x, such as if you searched e621 and found the true source of the 65569 file on there), it is similarly dropped. These hacks eliminate most of the Case 2 situations, perhaps not in the first pass, but moreso in subsequent subscription runs on the same or multiple sites. The full picture of correct sources are all added together and hydrus can suddenly infer suddenly that there is a knot or bump in its URL knowledge and ultimately finds the missing file.

What we do absolutely miss without hope are Case 3. This is particularly important for rare files and smaller boorus, perhaps where someone hand-pasted a bad source and there is no additional data--or perhaps the hydrus user only downloads from one or two booru sources, so the URL store is limited--hydrus can miss a file permanently. Your situation above, where users might have only uploaded a couple of sporadic alternates to different sites and then linked them spontaneously, may well be an example. I regret it.

The benefits of the current system--and why I have the options set by default to use it everywhere--are that it is simple and it saves a lot of time and bandwidth. This logic can already get complicated, so I will have to be careful about adding another layer of metadata here. I also don't want to end up scheduling six times the page downloads simply to recover what might be 0.2% of missing files. Since most hydrus users are absolutely overwhelmed with way too many files already, I don't want to get too occupied looking after strays. I have a related problem in that most sites now use CloudFlare or similar to mirror their content, and in rare situations those CDNs 'optimise' files in their mirrors, which means the hash of a file you download may not be what hydrus thought it was yesterday, or what the site said it would be. Same as you say with different image qualities; while we certainly want higher qualities when possible, we probably don't want the lower qualities, so perhaps I should figure out better easy-automatic duplicate decision-making before then. There is a certain amount of inherent messiness, and some solutions, if we aren't careful, will overload us with one sort or another of spam.

A good mitigation right now is to make sure you are downloading from multiple locations. If the e621 link is messed up, it won't be on another site, and holes tend to fill in over time.

brachna commented 12 months ago

Wish I knew about that option before-hand, because it's such a deal-breaker for me. Bandwidth is not such a problem for me personally since I have unlimited internet plan, but it can be very useful in some other situations as well. For example kemono.party downloader. That one transforms api link into normal post and then associates it as POST url. For that one I enable "associate (and trust) additional source urls", by default i set it to OFF. Point is when I do this I make informed decision and take responsibility, I know what it entails now if I get it wrong. I would never enable this for any booru if that means those urls will be counted and I potentially miss on files.

Anyway, I tried to remedy this by mass deleting all urls from all files, then I would simply regenerate them by re-downloading most of them. At first I tried using manage->urls and mass delete them (~250000 selected files). Hydrus did... something for 25 hours then I gave up and closed it. Suggestion would be to add some sort of progress bar for slow operations like these (Hydrus db is on non-OS SSD). Without it it's unclear if it was even working or hit some race condition or something. What I did then was open client.master.db in SQLiteStudio and cleared urls table entirely. At first everything was smooth, but then I noticed some re-downloaded images were having multiple urls of same domain, obviously referring to different files. So again I cleared urls table in client.master.db and also cleared urls table in client.db. Then deleted client.caches.db for good measure. Hydrus gave a ton of warnings on boot, but so far seems to work just fine. Or am I missing something?

I may come off as ranty, don't get me wrong. Hydrus is a fantastic program and I thank you very much for working on it.

floogulinc commented 12 months ago

It's really not advised to go editing the DB like that without actually knowing how it works and having a backup.

brachna commented 12 months ago

Agreed, it was done mostly out of desperation. Worst case scenario I'll make a clean Hydrus installation and import client_files to it.

hydrusnetwork commented 10 months ago

Yeah, if you are going into the database, make sure you have a good backup. Then, if, when you boot, you get weird behaviour, you can roll back to the backup and try again.

There is another table that relies on the urls table, the url_map in client.db. Is that what you cleared in your second attempt? If so, I think you are now good.

Sorry for the trouble on the 25k-strong hanging dialog. Most of my dialogs are only ok up to a few hundred or a thousand files. I don't have good scaling tech or asynchronous UI updates on the rarer commands yet.

I think you have figured out a method to get your downloaders working how you want here, so I'm going to close the issue.

brachna commented 10 months ago

I decided to make a fresh db, took a while, but better now then later. I'll keep in mind about url_map, thanks.