scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.56k stars 465 forks source link

fix leak of memory in cache - add settings.CACHE_SIZE_LIMIT #1140

Closed chebotarevmichael closed 1 year ago

chebotarevmichael commented 1 year ago

PROBLEM: leak of memory.

import dateparser
from datetime import datetime

# ~1.5GB of leaked memory after function finish
def hard_leak():
    for i in range(3000):
        # every call == -0.55MB of leaked memory
        dateparser.parse('dasdasd', settings={'RELATIVE_BASE': datetime.utcnow()})

# ~27MB of leaked memory after function finish
def light_leak():
    for i in range(3000):
        # every call == -0.01MB of leaked memory
        dateparser.parse('12.01.2021', settings={'RELATIVE_BASE': datetime.utcnow()})

After each calling of dateparser.parse new item is added to cache dictionaries:

    _split_regex_cache = {}
    _sorted_words_cache = {}
    _split_relative_regex_cache = {}
    _sorted_relative_strings_cache = {}
    _match_relative_regex_cache = {}

After 3000 calls we will found 3000 items in each of dictionaries, and we have lost few memory. We are forced to stop using this module.

SOLUTION: add a limit (CACHE_SIZE_LIMIT) for max items in caches.