lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
245 stars 61 forks source link

Fixed Issue #43 #47

Closed gleb-shnshn closed 3 years ago

gleb-shnshn commented 5 years ago

I have added checking for commas and test for this case

lipoja commented 5 years ago

Please take a look on the patch. It probably broke few things because almost all test failed.

And I am not confident about this patch. Please have a look at RFC3986. If I read it correctly comma is valid character that can be in hostname. I am still thinking about the cases which might be correct but we would filter them out.

What is your opinion about this?

gleb-shnshn commented 5 years ago

Sorry, I hurried a little bit, i got why it is failed, so i' ll try to fix it in time. And i got that my solution kinda not correct, bc commas are used in afterpath. I mean the place in the url after / - http://www.sample.com/forum/read.php?13,35869 . I didn't see other ways to use it in url.

Hence, i think we need to split url for 2 parts - before slash and after it. And the address is not valid if first part contains commas

lipoja commented 5 years ago

I am not sure about the commas in host part of URL as well. Form RFC:

host         = IP-literal / IPv4address / reg-name
reg-name     = *( unreserved / pct-encoded / sub-delims )
sub-delims   = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

So this problem is not that easy as it look at first sight.

gleb-shnshn commented 5 years ago

I haven't seen before any urls with commas in host part as well as browsers don't recognize them as a part of url, so even if it is possible, their appearance is too rare to keep in touch with them in extractor.

lipoja commented 5 years ago

Should we somehow consider also inputs like this:

This is text with URL right after comma,subdomain.example.com

In this case it would be nice if we could return subdomain.example.com and not just skip it.

I went through the code to remind me how is it extracted. And I do not already support commas in domain name. But subdomains are little bit different. I agree with you that comma should not be in the domain name but what about subdomains? Should we treat with it as domain name and do not allow commas?

Or what about ftp://login:pass,word@example.com ?

gleb-shnshn commented 5 years ago

Is there any example of subdomains which contain commas? I think 'ftp://login:pass,word@example.com' is appropriate input, and i'll try to patch it, but to fix 'This is text with URL right after comma,subdomain.example.com' the whole proccess logic should be reconsidered and i don't know how

lipoja commented 4 years ago

@gleb270 Hello, I am sorry for being such an unreachable maintainer. Could I ask you a favor? Would you be so kind and resolve the conflicts. Thank you!

lipoja commented 4 years ago

Hi @gleb270, please have a look on my comments. Could we discuss this topic little bit more? What I mentioned there is that, I do not think that filtering everything out is a good choice.

My point of view is to get users the ability to tune and tweak this library by settings. Therefore I was always aiming to extract more rather then less. And then user can processes extracted URLs and/or tune this library by setting stop characters to fit his needs.

lipoja commented 4 years ago

Hi @jayvdb, since you are the heaviest contributor these times I would like to know your opinion on this PR and the issue in general.

jayvdb commented 4 years ago

This would make URLExtract unusable for my use case. I expect URLExtract to give me more rather than less. I can post-process for validity based on the application needs. URLExtract looses its value if I need add my own extraction for potential hits that URLExtract omits.

lipoja commented 3 years ago

I am closing this PR since there is no progress and right now it is not ideal solution and it may introduce issues.