lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

SyntaxError: (unicode error) #72

Closed bmfirst closed 4 years ago

bmfirst commented 4 years ago

Hi Jan,

Please help, when i add input file in the path, i get

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

from urlextract import URLExtract

extractor = URLExtract() urls = extractor.find_urls('C:\path\test1.txt') print(urls) # prints: ['janlipovsky.cz']

I would like to extract urls from text file containing Unicode characters (including hostnames) and to create a output file for easy review, please tell me is that possible

Sorry for probably irrelevant question but i can not find anything in online docs or previous issues , i`m just begging to use Python.

Regards

lipoja commented 4 years ago

Hi @bmfirst, it does not work as you are using it. urls = extractor.find_urls('C:\path\test1.txt') This line of code tries to find URLs in text "'C:\path\test1.txt'" and not in the file. If you want to extract URLs from file you could use the command line version to just extract it from text (try running urlextract C:\path\test1.txt in terminal). But to be honest I did not test it on Windows.

Or if you want to do it in python then you should do:

from urlextract import URLExtract

extractor = URLExtract()
urls = []
with open('C:\path\test1.txt') as f:
    for line in f:
        tmp_urls = extractor.find_urls(line)
        urls.append(tmp_urls)

print(urls)
bmfirst commented 4 years ago

Hi Jan, thank you very much for fast reply :)

Please note i did run the command in Python but i got error: File "", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Also, is it possible to print urls to output file?

Regards

lipoja commented 4 years ago

Hi @bmfirst , I google the error and it might be because of string notation for path. See https://stackoverflow.com/questions/37400974/unicode-error-unicodeescape-codec-cant-decode-bytes-in-position-2-3-trunca

So I would suggest to try it out:

from urlextract import URLExtract

extractor = URLExtract()
urls = []
with open(r'C:\path\test1.txt') as f:
    for line in f:
        tmp_urls = extractor.find_urls(line)
        urls.append(tmp_urls)

print(urls)

or other version open("C:/path/test1.txt") or open("C:\\path\\test1.txt"). As I said before, I don't have place where to try it out. I stopped using Windows back in 2006.

And yes, you can save URLs to file. However it is basic python knowledge. I would recommend you to watch/read some python tutorials. It will help you with learning basics (such as saving data to file).

If you replace the print from above you should save all found URLs to file

with open('found_urls.txt', 'w') as furls:
    for url in urls:
        furls.write(url + '\r\n')

Have a nice day, Jan

lipoja commented 4 years ago

If you have any other issues with URLExtract feel free to reopen this issue or create new one. Good luck with coding.