hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.39k stars 158 forks source link

URL import fails on specific subdomain #1176

Closed boobayayo closed 1 year ago

boobayayo commented 2 years ago

Hydrus version

488d

Operating system

Windows 11

Install method

Third party (AUR, Docker, Chocolatey, etc. Specify in comments)

Install and OS comments

scoop package manager same behavior on regular extract

Bug description and reproduction

Background

Since parsing of instagram stories does not work in hydrus directly (js stuff I assume) I've built a userscript that generates a url in-browser, containing some metadata and the media download url (image/video) as url parameters (2. below)

I then have an url class matching these urls including the custom parameters. This url class is linked to a parser which parses the metadata and media url using regex replacement on the url context variable The media url (3.) is then pursued, recognized as a file, downloaded and gets the parsed tags attached to it.

This has all worked perfectly before. (last used on 2022-04-22)

Problem

Now, when using url import Hydrus seems to append the parsed media url to the base instagram story url (4.). This behavior suggests to me that it thinks the media url not a valid full url (like when parsing a relative path /assets/picture.png). When testing the parser manually in the parser configuration screen this does not happen (5.)

More info

For testing I have modified my parser to output a heavily simplified url and even when removing the entire path, url import still has the same error (6. and 7.)

Another parser that outputs similar cdninstagram urls works flawlessly (regular instagram post parser).

Sorry for the wall of text, I might just be missing something obvious here, but I can't figure it out.

Log output

1. original story url:
https://www.instagram.com/stories/jonsetter/2860720946514418825/

2. generated url sent to hydrus:
https://www.instagram.com/stories/jonsetter/2860720946514418825?mediaUrl=https://scontent-dus1-1.cdninstagram.com/v/t51.2885-15/287715666_312179114333180_6690026352780841598_n.jpg?stp=dst-jpg_e35&cb=9ad74b5e-88ad7ee8&_nc_ht=scontent-dus1-1.cdninstagram.com&_nc_cat=108&_nc_ohc=DTuw5_uDzEIAX-PLzEP&edm=ANmP7GQBAAAA&ccb=7-5&ig_cache_key=Mjg2MDcyMDk0NjUxNDQxODgyNQ%3D%3D.2-ccb7-5&oh=00_AT_p8Exu4hxnUY0uNvyd-P6NxlgqCk0wpsBaw8HL_MR3oQ&oe=62AC740A&_nc_sid=276363&timestamp=2022-06-14T22:08:50.000Z&user=jonsetter

3. media url:
https://scontent-dus1-1.cdninstagram.com/v/t51.2885-15/287715666_312179114333180_6690026352780841598_n.jpg?stp=dst-jpg_e35&cb=9ad74b5e-88ad7ee8&_nc_ht=scontent-dus1-1.cdninstagram.com&_nc_cat=108&_nc_ohc=DTuw5_uDzEIAX-PLzEP&edm=ANmP7GQBAAAA&ccb=7-5&ig_cache_key=Mjg2MDcyMDk0NjUxNDQxODgyNQ%3D%3D.2-ccb7-5&oh=00_AT_p8Exu4hxnUY0uNvyd-P6NxlgqCk0wpsBaw8HL_MR3oQ&oe=62AC740A&_nc_sid=276363

4. Hydrus Note when using URL import:
Found a URL--https://www.instagram.com/stories/jonsetter/2860720946514418825/https%3A%2F%2Fscontent-dus1-1.cdninstagram.com%2Fv%2Ft51.2885-15%2F287715666_312179114333180_6690026352780841598_n.jpg%3Fstp%3Ddst-jpg_e35&cb=9ad74b5e-88ad7ee8&_nc_ht=scontent-dus1-1.cdninstagram.com&_nc_cat=108&_nc_ohc=DTuw5_uDzEIAX-PLzEP&edm=ANmP7GQBAAAA&ccb=7-5&ig_cache_key=Mjg2MDcyMDk0NjUxNDQxODgyNQ%3D%3D.2-ccb7-5&oh=00_AT_p8Exu4hxnUY0uNvyd-P6NxlgqCk0wpsBaw8HL_MR3oQ&oe=62AC740A&_nc_sid=276363--but could not parse it: Could not find a parser for instagram story URL Class!

5. manual parsing:
*** 1 RESULTS BEGIN ***

tag: date:2022-06-14
downloadable/pursuable url (priority 50): https://scontent-dus1-1.cdninstagram.com/v/t51.2885-15/287715666_312179114333180_6690026352780841598_n.jpg?stp=dst-jpg_e35&cb=9ad74b5e-88ad7ee8&_nc_ht=scontent-dus1-1.cdninstagram.com&_nc_cat=108&_nc_ohc=DTuw5_uDzEIAX-PLzEP&edm=ANmP7GQBAAAA&ccb=7-5&ig_cache_key=Mjg2MDcyMDk0NjUxNDQxODgyNQ%3D%3D.2-ccb7-5&oh=00_AT_p8Exu4hxnUY0uNvyd-P6NxlgqCk0wpsBaw8HL_MR3oQ&oe=62AC740A&_nc_sid=276363
tag: source:instagram
source time: 2022-06-14 22:08:50
associable/source url (priority 50): https://www.instagram.com/stories/jonsetter/2860720946514418825
tag: person:jonsetter

*** RESULTS END ***

6. simplified media url parsed for testing:
https://scontent-dus1-1.cdninstagram.com

7. Hydrus Note with simplified url:
Found a URL--https://www.instagram.com/stories/jonsetter/2860720946514418825/https%3A%2F%2Fscontent-dus1-1.cdninstagram.com--but could not parse it: Could not find a parser for instagram story URL Class!
hydrusnetwork commented 2 years ago

Thank you for this report. I am sorry for the delay. I am back from my vacation and now again trying to catch up on github issues.

I have made some improvements to some URL parsing login the past couple of months, so I guess, absent other changes in this downloader, this is what has broke you since 2022-04-22. I think I fixed some URL Class ordering, so that more complicated URL classes would be matched before simpler ones. As an example:

https://site.com/123456?display=large

is considered more complicated than

https://site.com/123456

Maybe this affects you, maybe it is something else and I am just forgetting what I have changed. I note that the 2 URL is basically the 1 URL but with some parameters, so maybe something is getting confused here. It might be worth checking your 'manage url class links' dialog, just to make sure your URL classes are linked up to the parsers you expect and there aren't any 'instagram parser test (do not use)' spare objects that somehow got linked up without you realising. Also worth pasting these URLs into the test area in 'manage url classes' just to make sure they are being linked to the URL classes you think.

The second question is the URL append in part 4. Your idea about it being /assets/picture.png sounds correct, but I'm not sure why the URL would be seen this way--it doesn't seem to have any weird characters or anything that might throw the parser off. But I do note that it is added with URL-encoded characters. Rather than https://, it is adding:

https%3A%2F%2Fscontent-dus1-1.cdninstagram.com%2Fv%2Ft51.2885-15%2F287715666_312179114333180_6690026352780841598_n.jpg%3Fstp%3Ddst-jpg_e35&cb=9ad74b5e-88ad7ee8&_nc_ht=scontent-dus1-1.cdninstagram.com&_nc_cat=108&_nc_ohc=DTuw5_uDzEIAX-PLzEP&edm=ANmP7GQBAAAA&ccb=7-5&ig_cache_key=Mjg2MDcyMDk0NjUxNDQxODgyNQ%3D%3D.2-ccb7-5&oh=00_AT_p8Exu4hxnUY0uNvyd-P6NxlgqCk0wpsBaw8HL_MR3oQ&oe=62AC740A&_nc_sid=276363

With %3A and %2Fs going on. So it feels like something is being encoded (or maybe the parsed URL is not decoded?) before it is hits the last step of the Content Parser. Not sure why the manual parsing would work ok though. :/

Do you think you could bundle all these objects into a png or some JSON and post them here, so I can test them on my end? If you haven't done it before, hit up network->downloaders->export downloaders and then add all your instagram objects to it, and then export to png and post here.

boobayayo commented 2 years ago

Thank you for looking into this.

My url class links look correct, I've reapplied them just to be sure, but that has not changed anything. image (the regular story post url classes (first and third entry) are not linked since I only add them as associated urls)

The "manage url classes" dialog also reliably identifies all relevant urls properly.

Also for completeness' sake, v490 did not change any of this behavior.

Here's all the Instagram story stuff as png (example links are nsfw btw): insta stories

edit: Oh and the regular Instagram post parser. (working fine)(needs login cookies) instagram post downloader

boobayayo commented 1 year ago

fyi I worked around this by percent-encoding the url parameter in my userscript and then decoding it in the url parser.