lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
242 stars 61 forks source link

Best practices for using a URLExtract object for speed? #84

Closed dfrankow closed 3 years ago

dfrankow commented 3 years ago

I am using this object in a Django request to parse several URLs. I think it's slowing down my requests significantly.

The "Profile" tab of django debug toolbar for processing 9 strings:

image

I'm instantiating URLExtract once per string, which is likely wasteful.

I think I could instantiate one URLExtract object and keep it around to use it.

Does it have any state from string matching that would make that a bad idea? Should I make some Django middleware to make it per-request? Can it live longer than that (e.g., one per python process)?

What are best practices for keeping a URLExtract object around?

dfrankow commented 3 years ago

To be clear, if it's safe, the very easiest thing is one per process:

_the_extractor = URLExtract()

def find_urls(string):
   _the_extractor.find_urls(string)

I'm trying to understand if I can do that, or if that would have weird failure modes.

For example, if I ever use Django async or have multiple events going on, if that would get wacky.

dfrankow commented 3 years ago

Another way to put it is: does find_urls change any object state? If it doesn't, then I'm golden.

lipoja commented 3 years ago

Hello @dfrankow, thank you for your question and sorry for my late answer :)

Method find_urls does not change anything in object. I think it should be safe to instantiate it just once. At least that is how I am using it. I have one instance that is configured (specific stop characters) and then I just call find_urls with input string.

However I do not use it from multiple threads/processes.

lipoja commented 3 years ago

Only methods starting with set_, remove_enclosure, add_enclosure, update, update_when_older and properties ignore_list, extract_localhost, extract_email change the object.

dfrankow commented 3 years ago

Thanks for the info!

I am pretty sure if find_urls and has_urls do not modify, then I'm safe. I am not sure what more certainty to gain (other than maybe writing multi-threaded tests), so I'll close this.