Url checker: Marks any urls containing # characters as bad

openequella / openEQUELLA

Core openEQUELLA sources

https://openequella.github.io/

Apache License 2.0

42 stars 44 forks source link

Url checker: Marks any urls containing # characters as bad #1187

Open mrblippy opened 5 years ago

mrblippy commented 5 years ago

Describe the bug The url checking results can be seen in the referencedurl table. Here is an example db entry

The # character is valid character to use in a url, so we shouldn't be marking urls containing them as bad.

Sounds like it could be an encoding issue or something like that

To Reproduce Steps to reproduce the behavior:

Open a contribution wizard
Create a URL attachment with a url that you know to work, containing a # character. I used https://smallbusiness.chron.com/put-symbol-url-48344.html#heading1
Save and publish the item
When you view the attachment details, you will see the url as been marked as bad, with a corresponding entry in the referencedurl table

Expected behavior Urls that contain # characters should be marked as good, provided they respond with a valid http response code

Platform:

OpenEquella Version: 2018.2, 2019.1
OS: Windows 10

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

edalex-ian commented 4 years ago

This is fundamentally due to oEQ having it's own URL validator in the URLCheckerService - isURL(): https://github.com/openequella/openEQUELLA/blob/develop/Source/Plugins/Core/com.equella.core/src/com/tle/core/url/URLCheckerService.java#L389

It has been lost in history as to why this was done, but I propose we simply remove it and utilise standard java.net.URL to validate it. e.g.

String url = "https://smallbusiness.chron.com/put-symbol-url-48344.html#heading1";
new URL(url);

At which point if the URL is not valid, then a MalformedURLException exception is thrown which will also supply much more meaningful messaging to be stored.

Note: The isURL() method is only called from that one place - so it is probably best to remove and rework the call site to use the above and handle that exception rather than the current generic IllegalArgumentException.

abidingotter commented 4 years ago

@edalex-ian it could be because Url constructor would allow arbitrary protocols, or it wasn't good enough for whatever else reason. It should be straightforward to add a hash component to the regex?

edalex-ian commented 4 years ago

hmmm, that could be the reasoning. But I don't see great value in maintaining our own just for the sake of that - especially when compared to the better error reporting from URL (which even includes checking for known protocols - or we can add a check to simply ensuring we're only http[s]).

abidingotter commented 4 years ago

I guess it should be easy enough to revert if we remember the reason :)

scorpsteals commented 4 years ago

Is there a way to disable bad url checker. We are getting some bad url notifications even though url works fine in the browser.