lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Passing custom cache_dir doesnt seem to actually save the tlds...txt file in that dir #118

Open Yossi opened 2 years ago

Yossi commented 2 years ago
(venv) yossi@ubuntu7:~/testing$ python --version
Python 3.10.2
(venv) yossi@ubuntu7:~/testing$ pip list
Package      Version
------------ -------
filelock     3.6.0
idna         3.3
pip          22.0.3
platformdirs 2.5.1
setuptools   58.1.0
uritools     4.0.0
urlextract   1.5.0
(venv) yossi@ubuntu7:~/testing$ more test.py 
from urlextract import URLExtract
import logging
logging.basicConfig(format='%(asctime)s - %(levelname)s\n%(message)s', level=logging.INFO)

extractor = URLExtract(cache_dir='.')
extractor.update() #  same results with or without this line

urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
print(urls) # prints: ['janlipovsky.cz']
(venv) yossi@ubuntu7:~/testing$ python test.py 
2022-02-23 23:50:02,092 - INFO
Cache file not found in './tlds-alpha-by-domain.txt'. Use URLExtract.update() to download newest version.
2022-02-23 23:50:02,093 - INFO
Using default list of TLDs provided in urlextract package.
['janlipovsky.cz']
(venv) yossi@ubuntu7:~/testing$ ls -la
total 20
drwxrwxr-x   3 yossi yossi 4096 Feb 23 23:48 .
drwxr-xr-x 120 yossi yossi 4096 Feb 23 23:43 ..
-rw-rw-r--   1 yossi yossi  357 Feb 23 23:48 test.py
-rwxrwxr-x   1 yossi yossi    0 Feb 23 23:50 tlds-alpha-by-domain.txt.lock
drwxrwxr-x   5 yossi yossi 4096 Feb 23 23:47 venv