Generic URL Scraper - Githubissues

hackel commented 7 years ago

It would be nice if this implemented a generic URL scraper of some kind, so that each individual site didn't have to be coded manually. Case in point, the link to this page from AMO:

https://outgoing.prod.mozaws.net/v1/d6c54b48bd1142d3dee6387e3d3feabc610d77ab48590ae0a43e6c20d93db01e/https%3A//github.com/idlewan/link_cleaner

If there was a way to recognize the actual URL here automatically, that would be wonderful. Similarly for Google:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiU4JDYyZvUAhUoxFQKHf5IAEAQFggsMAE&url=https%3A%2F%2Fgithub.com%2Fidlewan%2Flink_cleaner&usg=AFQjCNHLsiLWuJifp8qBynFPaicSw0gLGw&sig2=imkIeC-CN_z-8x5NgFr4TQ

I'm not sure of the best logic to avoid breakage. It's a very complicated issue. As a start, split the URL query parts, then URL decode them, then compare them against the following regex to see if they match, and if so, navigate to it instead:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? (From Appendix B in https://www.ietf.org/rfc/rfc3986.txt)

Of course the first URL I gave isn't a query parameter, so if there's still no match after that, perhaps running the regex against the entire URL would also be appropriate.

idlewan commented 7 years ago

There is no generic way of doing redirects that is used by all websites. To prevent website breakage, it's better to approve and check first all needed urls and how they work, which search params they use.

ghost commented 7 years ago

It'd be nice to have one for Google outgoing URLs, though. Could we put that on the list?

idlewan commented 7 years ago

Why not, please create another issue/pull request for that purpose.

idlewan / link_cleaner

Generic URL Scraper #9