robinst / autolink-java

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart
MIT License
207 stars 40 forks source link

adding the possibility for extracting www-based URLs as well #8

Closed MTDdk closed 7 years ago

MTDdk commented 7 years ago

Have you had some time to look at the updated pull request?

robinst commented 7 years ago

Sorry, thanks for pinging :). I don't think we should also accept WWW.example.org, that doesn't seem to be something someone would type that often. And if they really want that, they can add http:// in front. Would you mind removing it? The rest looks good now!

MTDdk commented 7 years ago

I have removed the check for uppercase W.

I also have a suggestion to rename UrlScanner to HttpScanner and introduce an extra link type, so we have a LinkType.HTTP, LinkType.WWW and LinkType.URL, whereas the latter includes to two former. This will not exactly break backwards compatibility, but just add extra functionality to the same methods.

What do you say?

robinst commented 7 years ago

HttpScanner wouldn't be accurate because it also scans https or ssh or other schemes. I think leaving it as now is good.

What would be gained by adding another LinkType? I can see it lead to bugs because a user could get confused when checking the type of an extracted LinkSpan. And what happens when you tell it to extract type URL and HTTP, would be redundant.

MTDdk commented 7 years ago

My thinking was to rename UrlScanner -> HttpScanner and WwwUrlScanner -> WwwScanner I think this would more appropriately tell what the classes do.

The addition to LinkType just needed to reflect this and thus have an additional type. The LinkType.URL would then be used when you wanted to extract ALL links (http and www) but not email.

However, I do not mind, if you find this confusing to the user. It was just a thought. I could create a new pull request, so you could see what I meant

robinst commented 7 years ago

The thing that might confuse users is that www.example.org is not a URL at all, see here: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax

You can already extract URL and WWW links but not emails by configuring the builder like this:

LinkExtractor linkExtractor = LinkExtractor.builder()
        .linkTypes(EnumSet.of(LinkType.URL, LinkType.WWW))
        .build();

Given that, I'm not sure what having an additional LinkType that is a combination of other link types would improve.

robinst commented 7 years ago

Merged. Thanks a lot for submitting the pull request and seeing it through :)!

I noticed that the README also needs a section for the new functionality. Would you mind submitting another PR for that?

robinst commented 7 years ago

Released in 0.6.0 now :tada:: https://github.com/robinst/autolink-java/releases/tag/autolink-0.6.0