mvdan / xurls

Extract urls from text
BSD 3-Clause "New" or "Revised" License
1.19k stars 116 forks source link

Issue with Email Addresses #53

Closed JimmyGalar closed 9 months ago

JimmyGalar commented 3 years ago

I am using the xurls code to pull out possible urls from a message body string. The urls can be in either strict or relaxed format so I need to use the relaxed method of xurls to find the possible urls in the string. The issue is that email addresses can also be in the string and the relaxed method of xurls is pulling those out too.

For example my string might be: "Hello from http://www.google.com, please check the www.test.com webpage for further information. If you have any questions please email John.Smith@test.com or Testing@test.com"

What I would like xurls to do is just pull the http://www.google.com or www.test.com.

Instead is pulls the 2 urls, and John.Sm, test.com, test.com. Is there anything that can be done so that only urls are pulled?

mvdan commented 3 years ago

You can fairly easily filter out emails from the result. For example, if a result contains @ but does not contain ://, it's an email.

If we just returned URLs, the quality of the results would be worse. The input john.smith@test.com would give you test.com, for example.

JimmyGalar commented 3 years ago

I did that as a workaround, was hoping there was a pattern or systemic way to exclude versus removing from the string to be interpreted.

mvdan commented 3 years ago

I'm happy to discuss API ideas if you have any, but remember that it's unlikely we can "disable" matching emails in the relaxed regexp.

mvdan commented 3 years ago

What would you think of adding a new top-level API like:

func IsEmail(string) bool

Then, you could iterate over your xurls.Relaxed results and use xurls.IsEmail to filter emails as needed. In the future we could write other similar helper funcs, like HasScheme.

mvdan commented 2 years ago

Friendly ping @JimmyGalar :)

mvdan commented 10 months ago

I thought about this briefly and pushed https://github.com/mvdan/xurls/commit/09d66fb475fb3e22da5d04135c6c168f1038d40b to master, what do you all think?

mvdan commented 9 months ago

I will assume that the fix in master is enough. Feel free to leave a comment or file a new issue if you disagree.