lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

URLExtract() init really slow #129

Open gilbd opened 2 years ago

gilbd commented 2 years ago

Hi, while trying to use the URLEextract() in function to parse a dataframe column, it runs really slow.
Here is my code:

def extract_urls(last):
    extractor = URLExtract()
    count = 0
    for text in lst:
        urls_found = extractor.find_urls(text)
        if len(urls_found) > 0 and MY_URL in urls_found:
            count += len(urls_found)
    return count 

df['col2'] = df['col1'].apply(extract_url)

It takes a long time due to the loading time of the TLDs and the FileLocks.
Maybe you shall convert this object to Singleton?
Another idea is to load the TLDs just once by converting the TLDs object to Singleton.