robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
201 stars 12 forks source link

URL parsing without protocol #7

Closed timvisee closed 2 years ago

timvisee commented 4 years ago

I'm looking for something to fetch all URLs from text, and came across this crate. This is brilliant but I'm missing a feature.

I'd like to parse URLs without a protocol definition as well (excluding https?://). So, given this, the following text would yield example.com and example2.com: Some text that references example.com and (example2.com)

Now, I don't believe this is conform the specs you've linked, but I'm sure this could be useful for others as well. Parsing these links would be disabled by default of course, but could be enabled through a toggle or a new LinkKind type. Are you open for supporting something like this?

To provide some context: I'd like this because I want to parse URLs from Telegram messages. As you can imagine these are written like example.com most of the time. Telegram makes these links clickable, but these aren't currently found by linkify.

Thanks for the awesome crate by the way, the current interface is perfect!

robinst commented 4 years ago

Thanks! Yeah I've had requests for this before (not sure if here or for the Java version).

So my main concern with this is probably the problem of too many false positives. One of the ways to reduce that would be shipping a list of common top-level domains, but I don't really wanna have to do that. There's also too many of those nowadays :).

But maybe it could just find any text that matches that format, so foo.anything, and leave it up to the consumer to check if the domain exists or check against a list.

Does that sound good? Would you be willing to try to work on that? It shouldn't be too hard.

timvisee commented 4 years ago

One of the ways to reduce that would be shipping a list of common top-level domains, but I don't really wanna have to do that. There's also too many of those nowadays :).

That sounds like a never-ending keep-my-tld-list-up-to-date problem. Just leave it to the end user.

But maybe it could just find any text that matches that format, so foo.anything, and leave it up to the consumer to check if the domain exists or check against a list. Does that sound good?

Yes, for my use case that would be perfect.

Would you be willing to try to work on that? It shouldn't be too hard.

I might want to give it a try, it's just that I have limited time available for it.

I did take a quick look a few days back, it looked like you were using the @ and : character to search either an email address or URL in a blob of text. If that is correct, I'm wondering what the correct approach for an URL without protocol would be, maybe just the . (but that isn't unique for URLs versus email addresses).

How about API design, what do you think? A new LinkKind or a parameter to toggle this behavior somewhere?

timvisee commented 4 years ago

How about API design, what do you think? A new LinkKind or a parameter to toggle this behavior somewhere?

Currently attempting a PR, seems doable without taking too much time.

I'm currently going for a no_proto parameter in LinkFinder and UrlScanner. It defaults to false and has a setter on LinkFinder.

robinst commented 4 years ago

Yeah, new LinkKind I think. And yeah, the trigger character would have to be .. There's some interesting conflicts in that case, e.g.:

foo.bar@example.org

In that case, . would trigger first and we'd have to reject it as a domain, but then still allow @ to detect it as an email. For that to work, we need to make sure things like @ or + are not allowed after a domain.

Cldfire commented 3 years ago

One of the ways to reduce that would be shipping a list of common top-level domains, but I don't really wanna have to do that. There's also too many of those nowadays :).

The public suffix list could perhaps be helpful here (https://publicsuffix.org/, https://github.com/addr-rs/psl)