lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
244 stars 61 forks source link

URL containing space is truncated #95

Closed begunrom closed 2 years ago

begunrom commented 3 years ago

I am extracting url's from mail messages. If you have a text like MESSAGE STATUS CODE: UNDELIVERY LISTEN NOW <http://EXAMPLE.com/.amVucy5iaXJ-some-base64-t3b29sLmNvbQ== #aHR0cHM6Ly9taWNyb3NvZ-something-else-pcmdlcnNzb25Acm9ja3dvb2wuY29t>

The url is truncated at the '==' because there is a blank between the '==' and the '#aH'

I can remove the space before processing it, but i do not know what side effects that could have.

lipoja commented 3 years ago

If you know that every email that you are extracting has it this way then you can update stop_left_chars and remove "space" character from the list. So it will stop at ">" sign. You can use get_stop_chars_left() and remove the space char and use set_stop_chars_left() to set new the list.

The side effect might be request to different URL. The reason is when you put this in browser it will be encoded and send as request with encoded space so you might not get correct response - in worst cases HTTP 404.

Note: I did not try your URL, because it looks like phishing to me. Therefore I've updated your comment so nobody else accidentally tries that.

begunrom commented 3 years ago

Thank you for masking the URL, it was indeed phishing.

Since i expect randomly formatted url's, i cannot remove the spaces all the time. So i decided to run the url extraction twice, one with and one without space and compare the results. If the results differ, i know there is a URL inside where i have to add a space. Maybe that could be implemented as a standard way to do extractions?

lipoja commented 2 years ago

I think this is specific issue to your case. I think I will not "slow" down the extraction by doing two passes. If it is necessary users can do it by them selves as you did.

However thank you for your time reporting this issue.