hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.34k stars 152 forks source link

Doesn't detect hash correctly when importing #1563

Closed Cever77 closed 2 months ago

Cever77 commented 3 months ago

Hydrus version

v577

Qt major version

Qt 6

Operating system

Windows 10

Install method

Extract

Install and OS comments

No response

Bug description and reproduction

When importing via URL (new page - download - urls), the program incorrectly detects the picture hash. It considers these two pictures to be the same: https://derpibooru.org/images/1541005 https://derpibooru.org/images/1541004

As a consequence, it doesn’t allow you to load the second image if the first one is imported url recognised: Imported at 2024-06-03 10:34:47, which was 7 seconds ago before this check.

Previously I used version 558, there is no such problem.

Log output

No response

Zweibach commented 3 months ago

The error says nothing about hash. It says the URL is already known, looking at the posts they share a source URL which Hydrus trusts. You'll have to tell it to not trust source URLs for this booru if you want to avoid this problem. This is done if the URL classes for derpibooru.

Cever77 commented 3 months ago

If I double-click on this error, a page with a hash search will open, where one of these pictures will be, depending on which one was downloaded earlier. That is, if I download one picture, the program does not allow me to download the second

Cever77 commented 3 months ago

https://github.com/hydrusnetwork/hydrus/assets/171563534/8e377e0d-0e70-43f1-af23-875754e54031

Here's a video

Pictures differ in the character's skin color, if this is not immediately noticeable

floogulinc commented 3 months ago

For one the paste button takes whatever is in your clipboard and adds it to the queue so you don't need to paste in the url box and click that button lol.

Anyway Zweibach already explained what is happening. Both derpibooru posts have the same source url from deviantart and that known url must have a url class with "post can produce multiple files" unchecked. Hydrus sees the second post has the same source URL and considers it already downloaded. You can turn off this check in the import settings for the page.

brachna commented 3 months ago

Yeah it's not a hash problem, it's additional source url crap. Got to Options -> Importing -> File import options and uncheck "Associate (and trust) additional source urls" for both. But keep in mind that some Downloaders, like kemono one, use api and generate their only post url as additional source url. Also unchecking it won't help with post urls that are already in database.

1438

floogulinc commented 3 months ago

@brachna that's not really the preferred solution since then hydurs won't store the additional known urls at all. The thing to do is change the import settings for that context to not check known urls.

I do think hydrus at least needs the option to not use the additional known urls from the parser of the current url to match existing urls.

brachna commented 3 months ago

Depends on user's needs. Right now I do the opposite: turned off by default, enable for downloaders that generate their post urls.

hydrusnetwork commented 2 months ago

Thank you for this report. This is a tricky problem to think about, and for a while I was not confident about how your exact situation would occur and how it would/should be dealt with. I have had a think and am happy to say I believe I have fixed it for next week just by updating the normal URL-checking logic.

Under 'file import options', there is a checkbox for 'during URL check, check for neighbour-spam?'. Previously, this would filter the URLs a bit and say, 'If, during pre-import predictions, a URL we have parsed seems to match with a file, does that file have any other URLs of the same type? If so, do not trust this URL-file match.' There is a separate check, always done, which is 'if this potential lookup URL is supposed to only apply to one file but actually applies to multiple, distrust it. These rules would catch some situations, and would catch your situation if you were able to force the download and thus did already have a double-mapping of the DA URL, but you are correct that they do not catch the state where the client has only downloaded one but not the second.

I have fixed this by expanding the 'neighbour' test to say 'If, during pre-import predictions, a URL we have parsed seems to match with a file, does that file have any other URLs of the same type as any of our parsed lookup URLs? If so, do not trust this URL-file match.' This now catches the situation.

To state this clearly let's say the two derpi URLs here, for files F1 and F2, are A1 and A2, and the incorrect Deviant Art source URL is B. On the first download, hydrus assigns A1, B to F1.

Previously, on the second download it would parse A2, B, and say 'hey, B seems to be F1. does F1 have any other "B" URLs?', which would of course not be true, so it would seem that B was a trustworthy source. Now it says 'hey, B seems to be F1. does F1 have any other "A2" or "B" URLs?' and then it now says "hey, F1 has A1 already, that looks like A2. I think B must be some borked mapping mate, do not trust it." and it goes ahead with the download.

I'm pretty confident this does not break the normal true positive 'already in db' results.

Working:

image

And then just checking that our A1 and A2 do still give fast 'already in db' once they are imported:

image

Please try this again in v579 and let me know if you still have trouble.