robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
201 stars 12 forks source link

ASCII link inside unicode text without space #49

Closed serega closed 1 year ago

serega commented 1 year ago

Perhaps this is an edge case, but I have the following test case, which currently fails.

    #[test]
    pub fn test_ascii_link_in_unicode() {
        let doc = "秋叶值波有88块送到您的账号了请您登录领取地址app6699.tv";
        let mut finder = LinkFinder::new();
        finder.url_must_have_scheme(false);
        let mut links =  finder.links(doc);
        let next = links.next().unwrap();
        let link = &doc[next.start..next.end];
        assert_eq!("app6699.tv", link);
    }

Generally IRIs are not widely used. May be IRIs support can be optional in LinkFinder?

robinst commented 1 year ago

Hmm yeah, that is tricky. I'm not sure it would be reasonable to expect that to work, e.g. with an option to turn off IRIs, would ünicöde.com just detect de.com? That seems just confusing. It's a bit less clear cut with different writing systems, but I don't think I want to build that distinction in.

serega commented 1 year ago

I agree it is tricky, and yes, de.com is my desired behavior although it would not be correct in this case. I work with international short messages, many of which contains links, but they are all URI. The chance of incorrectly parsed URI is higher than missed IRI in my case. I use a patched version (in company git) that implemented the expected behavior. I understand you don't want to support this use case, so I will continue using the patched version. In any case - great library!

robinst commented 1 year ago

Would your patch be simple to implement as an optional behavior that can be turned on/off? Maybe it would be good to include it.

serega commented 1 year ago

Sure. I will make the patch configurable and submit a pull request

robinst commented 1 year ago

Thanks for doing the initial work, I've finished it and merged it now, to be included in the next release.

robinst commented 1 year ago

@serega Change released in version 0.10.0: https://github.com/robinst/linkify/blob/main/CHANGELOG.md#0100---2023-06-24