lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

Allow setting of a base dir for url cache #24

Closed keyz182 closed 5 years ago

keyz182 commented 5 years ago

I'm using this library in an application deployed on AWS Elastic Beanstalk. In it's environment, os.path.dirname(__file__) would resolve to an unwriteable directory. I've added the optional cache_dir parameter to the constructor, defaulting to the value it was previously, to allow code to explicitly set a base cache directory.

lipoja commented 5 years ago

Hi Kieran, thank you for the pull request.

I would like to know your opinion about logging. When it has the ability to change the cache_dir should we warn user when the cache_dir is not accessible (around line 86: if not os.access(dir_path, os.W_OK):).

I think you might be interested to see if the path that you provided is not writable. How do you see from users point of view?

keyz182 commented 5 years ago

While logging may have helped us find the issue a little sooner, the eventual error thrown (/home/wsgi/.urlextract_tlds not writeable or similar) guided us directly to the issue pretty quickly. Though, from a good practice point-of-view, logging that the code has reached a condition where it needs to fall back may be the best approach.

For us, on Elastic Beanstalk, both the current directory, and the home directory were not writeable (~ resolved to /home/wsgi which I don't think even existed. Would falling back to the temp directory be an option if the current dir is unreadable, rather than the home directory? Or make a temp directory the 3rd fallback?

The way we're using my branch currently is with the temp dir shown below, and is a cross platform compatible way of finding a temp directory.

import tempfile
extractor = URLExtract(cache_dir=tempfile.gettempdir())
keyz182 commented 5 years ago

I suppose you loose the cache then across reboots potentially, so may not be the most desirable, but as a fallback, may be acceptable.

lipoja commented 5 years ago

Thank you for your time, opinions and ideas! I really appreciate it. It is really good idea with the temp dir so I've crated issue (#25) for it. If you have time you and are interested in helping with it just write comment to the issue otherwise so I know you (or somebody else) will do it or I will work on it when I have free time.

Thank you and have a nice day.