hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.39k stars 158 forks source link

Blacklist certain subdomains from matching URL classes that otherwise allow subdomains to match #1623

Open roachcord3 opened 2 days ago

roachcord3 commented 2 days ago

Fanbox is a good example of this. They allow you to go to https://username.fanbox.cc or https://www.fanbox.cc/@username to get to the same page. Their API is hosted at https://api.fanbox.cc. This means that a URL class for fanbox.cc that matches subdomains will match api.fanbox.cc, which is not desirable, since that's not the gallery URL for an artist. This can lead to situations where it's hard to define a URL class that contains a redirect to a different subdomain of the same domain, since the class will match its own redirect URL.

So, what I'm requesting is either a subdomain blacklist, or, alternatively, allowing users to use regex in the domain part of the URL class (so that they can define something like (?!api\.)([a-z0-9][a-z0-9\-]*\.)?fanbox.cc.

floogulinc commented 2 days ago

More specific url classes should already take precedence over less specific ones. It should be the case that a url class with domain specifically api.fanbox.cc will be matched before one for fanbox.cc.

roachcord3 commented 2 days ago

@floogulinc that can be true in some cases, but not if the path components differ, which is actually the case as I'm trying to fix the latest community fanbox downloader (which is lacking gallery URLs for artists' pages and has a few other problems.) The API URLs have a required path component, while the gallery URLs don't (adding /posts is optional.) I ran into this error message about matching the URL's own redirect, and had to define the URL classes in an unnatural way, as a result.

I also think it would be beneficial for any other hypothetical site in which there are plenty of reserved subdomains for things like the API or settings pages or whatever, but they still do usernames-as-subdomains. It would be weird to have to define discrete URL classes for all the things you don't want. Of course, either way, you have to say what you don't want, but I think keeping it coupled with the main URL class makes sense both from a cleanliness perspective and for the principle of least surprise. And I guess to make it a bit harder to forget to include the URL classes if you export the downloader, too.

With that said, since there are workarounds for many cases, I wouldn't call this an urgent feature request by any means. It would just be nice to not have to define janky extra URL classes.

floogulinc commented 2 days ago

If it's not properly prioritizing the one with a defined subdomain I'd consider that a bug then

roachcord3 commented 1 day ago

@floogulinc is it? I thought required path components would make it not match.

I'll be more clear in the specifics of the problem here. See, I wanted to add a gallery URL class for a fanbox artist page. Ideally, I wanted it to be just one URL class, but I had to make two, one for the username-as-subdomain and one for the username-in-path pattern. So, here's what ended up being awkward about the username-as-subdomain URL class: I had to specify that it had a path component that matched ^(posts)?$, rather than just allowing extra path components with the option checkbox. See here:

edit_url_class edit_url_class-2 edit_conversion

This ensured it did not conflict with the community downloader's URL class for an artist page via the API (shown below,) nor with its own API redirect URL.

edit_url_class-3

Maybe it was just a skill issue on my part, but it took me longer than I'd like to finagle the part about matching its own API redirect URL. I didn't really have a problem with avoiding collision with the existing artist page via API URL class. And I assume that even if I allowed extra path components, the post page URL classes (shown below) would take precedence due to their higher specificity. (I haven't been able to test, of course, because I'm not able to define the URL classes the way I want.)

edit_url_class-4 edit_url_class-5

All told, I don't think there's a bug here, but it's possible that I'm just misunderstanding something; what do you think?

floogulinc commented 9 hours ago

I don't think this is an issue but why do you have "should subdomains also match this class" enabled for the api.fanbox.cc class?

Also the url class for posts that has the username in the url probably shouldnt have subdomains checked either...

roachcord3 commented 7 hours ago

Both of those are unchanged from upstream, but you're right, it doesn't make sense, so I'll fix that.