robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
201 stars 12 forks source link

String with URL and Email ignores finder.url_must_have_scheme ? #44

Closed thomasgloe closed 2 years ago

thomasgloe commented 2 years ago

Hi,

I tested the example from the docs and it works great:

use linkify::LinkFinder;

let input = "Look, no scheme: example.org/foo";
let mut finder = LinkFinder::new();

// true by default
finder.url_must_have_scheme(false);

let links: Vec<_> = finder.links(input).collect();
assert_eq!(links[0].as_str(), "example.org/foo");

However, when the input string is changed to let input = "Look, no scheme: example.org/foo email@foo.com"; The url example.org/foo is not detected anymore. The same applies to the demo website https://robinst.github.io/linkify/ with the input string.

Is this the expected outcome or a bug? Is there any additional switch to detect the url even if there is an email in a string?

Related to https://github.com/robinst/linkify/pull/8

thomasgloe commented 2 years ago

As a workaround it is possible to split a string in spans:

use linkify::LinkFinder;

let input = "Look, no scheme: example.org/foo Email: email@foo.com";
let mut finder = LinkFinder::new();

finder.url_must_have_scheme(false);

let spans: Vec<_> = finder.spans(input).collect();
for span in spans {
  let links: Vec<_> = finder.links(span.as_str()).collect();
  for link in links {
    println!(" - link: {}", link.as_str());
  }
}
mre commented 2 years ago

Not sure if it helps but can you also check against https://github.com/robinst/linkify/pull/43?

thomasgloe commented 2 years ago

Mh, I've switched to a simple RegEx approach as I've observed additional issues in my test case. Urls in my data are not too complicated. So even the workaround above did not fix all issues.

robinst commented 2 years ago

Yep, #43 fixes that problem too, I'll add that as a test case.

@thomasgloe can you provide the additional problematic cases that you've found here?

thomasgloe commented 2 years ago

Code example:

use linkify::{Link, LinkFinder};

fn find_links(input: &str) -> Vec<Link> {
    let mut finder = LinkFinder::new();
    finder.url_must_have_scheme(false);

    let mut links = Vec::new();
    let spans: Vec<_> = finder.spans(input).collect();
    for span in spans {
        // added second finder, to test if this makes any difference - it does not.
        let mut finder2 = LinkFinder::new();
        finder2.url_must_have_scheme(false);
        let mut tlinks: Vec<_> = finder2.links(span.as_str()).collect();
        links.append(&mut tlinks);
    }

    links
}

fn main() {
    // multiline input string
    let input = "Web:
www.foobar.co
E-Mail:
      bar@foobar.co (bla bla bla)";

    let links = find_links(input);
    for link in links {
        println!(" - link: {}", link.as_str());
    }
}

results in:

 - link: Web:
www.foobar.co
 - link: bar@foobar.co

But I would expect:

 - link: www.foobar.co
 - link: bar@foobar.co
thomasgloe commented 2 years ago

Indeed, I've checked with linkify = { git = "https://github.com/robinst/linkify", branch = "check-domains" } and the problematic case above seems to work.

robinst commented 2 years ago

Good to hear! To be honest, the implementation of the url_must_have_scheme(false) mode had a few problems before. With the branch, its logic is now unified with the others and much cleaner.

I'm releasing the change soon.

robinst commented 2 years ago

Alright, released as 0.9.0 🎉: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11