kurtmckee / feedparser

Parse feeds in Python
https://feedparser.readthedocs.io
Other
1.94k stars 342 forks source link

Memory fragmentation prevents memory release on Linux #287

Open Rongronggg9 opened 3 years ago

Rongronggg9 commented 3 years ago

Code to reproduce

feeds.tar.gz

import gc
import os
import colorlog
import psutil
from concurrent import futures
from feedparser import parse

# from memory_profiler import profile

colorlog.basicConfig(format='%(log_color)s%(asctime)s:%(levelname)s - %(message)s',
                     datefmt='%Y-%m-%d-%H:%M:%S',
                     level=colorlog.DEBUG)
logger = colorlog.getLogger()

def get_memory_usage():
    return f'Memory usage: {(psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024):.2f} MiB'

# @profile
def monitor(rss_content):
    rss_d = parse(rss_content, sanitize_html=False)
    if rss_d is None:
        return

    logger.debug('Parsed! ' + get_memory_usage())

    del rss_d
    gc.collect()
    logger.debug('Garbage collected! ' + get_memory_usage())
    return

# @profile  # if memory_profiler enabled, would not leak but runs slowly
def would_leak_1(feed_list):
    logger.info('would_leak_1 started! ' + get_memory_usage())

    for feed_content in feed_list:
        monitor(feed_content)

    logger.info('would_leak_1 finished! ' + get_memory_usage())

    gc.collect()

    logger.info('would_leak_1 garbage collected! ' + get_memory_usage())

# @profile
def would_leak_2(feed_list):
    logger.info('would_leak_2 started! ' + get_memory_usage())

    with futures.ThreadPoolExecutor(max_workers=1) as pool:
        for feed_content in feed_list:
            pool.submit(monitor, feed_content).result()

    logger.info('would_leak_2 finished! ' + get_memory_usage())

    gc.collect()

    logger.info('would_leak_2 garbage collected! ' + get_memory_usage())

# @profile
def main():
    logger.info('Started! ' + get_memory_usage())

    feed_list = []
    feeds = os.listdir('feeds')  # tons of feed.xml
    for feed in feeds:
        with open('feeds/' + feed, 'rb') as f:
            feed_list.append(f.read())

    logger.info('Feeds loaded into memory! ' + get_memory_usage())

    would_leak_1(feed_list)
    would_leak_2(feed_list)

    gc.collect()
    logger.info('Done! ' + get_memory_usage())

    del feed_list
    del feeds
    gc.collect()
    logger.info('Feeds in memory cleared! ' + get_memory_usage())
    return

if __name__ == '__main__':
    main()

My tests

feedparser 6.0.8

Debian GNU/Linux 11 (bullseye) on WSL (CPython 3.9.2) - Leaked!

neofetch ``` _,met$$$$$gg. ***@*** ,g$$$$$$$$$$$$$$$P. ---------------------- ,g$$P" """Y$$.". OS: Debian GNU/Linux 11 (bullseye) on Windows 10 x86_64 ,$$P' `$$$. Kernel: 5.10.43.3-microsoft-standard-WSL2 ',$$P ,ggs. `$$b: Uptime: 3 hours, 13 mins `d$$' ,$P"' . $$$ Packages: 1939 (dpkg) $$P d$' , $$P Shell: zsh 5.8 $$: $$. - ,d$$' Theme: Breeze [GTK2/3] $$; Y$b._ _,d$P' Icons: breeze [GTK2/3] Y$$. `.`"Y$$$$P"' Terminal: Windows Terminal `$$b "-.__ CPU: Intel i7-10510U (8) @ 2.304GHz `Y$$ GPU: f549:00:00.0 Microsoft Corporation Device 008e `Y$$. Memory: 487MiB / 1917MiB `$$b. `Y$$b. `"Y$b._ `""" ```
2021-10-04-07:37:34:INFO - Started! Memory usage: 42.16 MiB
2021-10-04-07:37:34:INFO - Feeds loaded into memory! Memory usage: 68.00 MiB
2021-10-04-07:37:34:INFO - would_leak_1 started! Memory usage: 68.00 MiB
2021-10-04-07:37:53:INFO - would_leak_1 finished! Memory usage: 105.77 MiB
2021-10-04-07:37:53:INFO - would_leak_1 garbage collected! Memory usage: 105.77 MiB
2021-10-04-07:37:53:INFO - would_leak_2 started! Memory usage: 105.77 MiB
2021-10-04-07:38:12:INFO - would_leak_2 finished! Memory usage: 165.69 MiB
2021-10-04-07:38:12:INFO - would_leak_2 garbage collected! Memory usage: 108.25 MiB
2021-10-04-07:38:12:INFO - Done! Memory usage: 108.25 MiB
2021-10-04-07:38:12:INFO - Feeds in memory cleared! Memory usage: 93.86 MiB

Debian GNU/Linux 11 (bullseye) on Azure b1s (CPython 3.9.2) - Leaked!

neofetch ``` _,met$$$$$gg. ***@*** ,g$$$$$$$$$$$$$$$P. ------- ,g$$P" """Y$$.". OS: Debian GNU/Linux 11 (bullseye) x86_64 ,$$P' `$$$. Host: Virtual Machine Hyper-V UEFI Release v4.1 ',$$P ,ggs. `$$b: Kernel: 5.10.0-8-cloud-amd64 `d$$' ,$P"' . $$$ Uptime: 4 days, 6 hours, 9 mins $$P d$' , $$P Packages: 681 (dpkg) $$: $$. - ,d$$' Shell: bash 5.1.4 $$; Y$b._ _,d$P' Terminal: /dev/pts/2 Y$$. `.`"Y$$$$P"' CPU: Intel Xeon E5-2673 v4 (1) @ 2.294GHz `$$b "-.__ Memory: 563MiB / 913MiB `Y$$ `Y$$. `$$b. `Y$$b. `"Y$b._ `""" ```
2021-10-03-23:35:10:INFO - Started! Memory usage: 20.17 MiB
2021-10-03-23:35:10:INFO - Feeds loaded into memory! Memory usage: 50.20 MiB
2021-10-03-23:35:10:INFO - would_leak_1 started! Memory usage: 50.46 MiB
2021-10-03-23:35:28:INFO - would_leak_1 finished! Memory usage: 94.25 MiB
2021-10-03-23:35:28:INFO - would_leak_1 garbage collected! Memory usage: 94.25 MiB
2021-10-03-23:35:28:INFO - would_leak_2 started! Memory usage: 94.25 MiB
2021-10-03-23:35:45:INFO - would_leak_2 finished! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - would_leak_2 garbage collected! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - Done! Memory usage: 152.66 MiB
2021-10-03-23:35:45:INFO - Feeds in memory cleared! Memory usage: 73.13 MiB

AOSC OS aarch64 (CPython 3.8.6) - Leaked!

neofetch ``` .:+syhhhhys+:. root@tmp-8d740a05 .ohNMMMMMMMMMMMMMMNho. ----------------- `+mMMMMMMMMMMmdmNMMMMMMMMm+` OS: AOSC OS aarch64 +NMMMMMMMMMMMM/ `./smMMMMMN+ Host: Pine64 RockPro64 v2.0 .mMMMMMMMMMMMMMMo -yMMMMMm. Kernel: 5.12.13-aosc-rk64 :NMMMMMMMMMMMMMMMs .hMMMMN: Uptime: 61 days, 17 hours, 31 mins .NMMMMhmMMMMMMMMMMm+/- oMMMMN. Packages: 441 (dpkg) dMMMMs ./ymMMMMMMMMMMNy. sMMMMd Shell: bash 5.1.8 -MMMMN` oMMMMMMMMMMMN: `NMMMM- CPU: (6) @ 1.416GHz /MMMMh NMMMMMMMMMMMMm hMMMM/ Memory: 216MiB / 3868MiB /MMMMh NMMMMMMMMMMMMm hMMMM/ -MMMMN` :MMMMMMMMMMMMy. `NMMMM- dMMMMs .yNMMMMMMMMMMMNy/. sMMMMd .NMMMMo -/+sMMMMMMMMMMMmMMMMN. :NMMMMh. .MMMMMMMMMMMMMMMN: .mMMMMMy- NMMMMMMMMMMMMMm. +NMMMMMms/.` mMMMMMMMMMMMN+ `+mMMMMMMMMNmddMMMMMMMMMMm+` .ohNMMMMMMMMMMMMMMNho. .:+syhhhhys+:. ```
2021-10-04-09:00:49:INFO - Started! Memory usage: 17.85 MiB
2021-10-04-09:00:49:INFO - Feeds loaded into memory! Memory usage: 43.87 MiB
2021-10-04-09:00:49:INFO - would_leak_1 started! Memory usage: 44.13 MiB
2021-10-04-09:01:39:INFO - would_leak_1 finished! Memory usage: 90.15 MiB
2021-10-04-09:01:39:INFO - would_leak_1 garbage collected! Memory usage: 90.15 MiB
2021-10-04-09:01:39:INFO - would_leak_2 started! Memory usage: 90.15 MiB
2021-10-04-09:02:30:INFO - would_leak_2 finished! Memory usage: 131.78 MiB
2021-10-04-09:02:30:INFO - would_leak_2 garbage collected! Memory usage: 131.78 MiB
2021-10-04-09:02:30:INFO - Done! Memory usage: 131.78 MiB
2021-10-04-09:02:31:INFO - Feeds in memory cleared! Memory usage: 131.78 MiB

Armbian bullseye (21.08.2) aarch64 (CPython 3.9.2) - Leaked!

neofetch ``` ***@*** ---------------- █ █ █ █ █ █ █ █ █ █ █ OS: Armbian bullseye (21.08.2) aarch64 ███████████████████████ Host: Pine H64 model B ▄▄██ ██▄▄ Kernel: 5.10.60-sunxi64 ▄▄██ ███████████ ██▄▄ Uptime: 1 hour, 56 mins ▄▄██ ██ ██ ██▄▄ Packages: 1098 (dpkg) ▄▄██ ██ ██ ██▄▄ Shell: zsh 5.8 ▄▄██ ██ ██ ██▄▄ Terminal: /dev/pts/0 ▄▄██ █████████████ ██▄▄ CPU: sun50iw1p1 (4) @ 1.800GHz ▄▄██ ██ ██ ██▄▄ Memory: 817MiB / 1989MiB ▄▄██ ██ ██ ██▄▄ ▄▄██ ██ ██ ██▄▄ ▄▄██ ██▄▄ ███████████████████████ █ █ █ █ █ █ █ █ █ █ █ ```
2021-10-08-17:22:46:INFO - Started! Memory usage: 19.61 MiB
2021-10-08-17:22:47:INFO - Feeds loaded into memory! Memory usage: 46.16 MiB
2021-10-08-17:22:47:INFO - would_leak_1 started! Memory usage: 46.16 MiB
2021-10-08-17:24:03:INFO - would_leak_1 finished! Memory usage: 87.75 MiB
2021-10-08-17:24:03:INFO - would_leak_1 garbage collected! Memory usage: 87.75 MiB
2021-10-08-17:24:03:INFO - would_leak_2 started! Memory usage: 87.75 MiB
2021-10-08-17:25:20:INFO - would_leak_2 finished! Memory usage: 125.73 MiB
2021-10-08-17:25:20:INFO - would_leak_2 garbage collected! Memory usage: 126.00 MiB
2021-10-08-17:25:20:INFO - Done! Memory usage: 126.00 MiB
2021-10-08-17:25:20:INFO - Feeds in memory cleared! Memory usage: 106.37 MiB

Windows 11 22000.194 (CPython 3.9.2) - Just leaked little, which can be ignored.

neofetch ``` ,.=:!!t3Z3z., ***@*** :tt:::tt333EE3 ---------------------- Et:::ztt33EEEL @Ee., .., OS: Windows 11 x86_64 ;tt:::tt333EE7 ;EEEEEEttttt33# Host: *** :Et:::zt333EEQ. $EEEEEttttt33QL Kernel: 10.0.22000 it::::tt333EEF @EEEEEEttttt33F Uptime: 9 hours, 26 mins ;3=*^```"*4EEV :EEEEEEttttt33@. Packages: 3 (scoop) ,.=::::!t=., ` @EEEEEEtttz33QF Shell: bash 4.4.23 ;::::::::zt33) "4EEEtttji3P* Resolution: 1920x1080 :t::::::::tt33.:Z3z.. `` ,..g. DE: Aero i::::::::zt33F AEEEtttt::::ztF WM: Explorer ;:::::::::t33V ;EEEttttt::::t3 WM Theme: Custom E::::::::zt33L @EEEtttt::::z3F Terminal: Windows Terminal {3=*^```"*4E3) ;EEEtttt:::::tZ` CPU: Intel i7-10510U (8) @ 2.310GHz ` :EEEEtttt::::z7 Memory: 14760MiB / 24329MiB "VEzjt:;;z>*` ```
2021-10-04-07:50:52:INFO - Started! Memory usage: 23.91 MiB
2021-10-04-07:50:52:INFO - Feeds loaded into memory! Memory usage: 49.70 MiB
2021-10-04-07:50:52:INFO - would_leak_1 started! Memory usage: 49.74 MiB
2021-10-04-07:51:08:INFO - would_leak_1 finished! Memory usage: 57.93 MiB
2021-10-04-07:51:08:INFO - would_leak_1 garbage collected! Memory usage: 57.93 MiB
2021-10-04-07:51:08:INFO - would_leak_2 started! Memory usage: 57.93 MiB
2021-10-04-07:51:26:INFO - would_leak_2 finished! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - would_leak_2 garbage collected! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - Done! Memory usage: 57.11 MiB
2021-10-04-07:51:26:INFO - Feeds in memory cleared! Memory usage: 30.46 MiB

Windows 11 22000.194 (PyPy 7.3.5, Python 3.7.10) - Leaked!

2021-10-04-07:55:34:INFO - Started! Memory usage: 45.91 MiB
2021-10-04-07:55:34:INFO - Feeds loaded into memory! Memory usage: 81.40 MiB
2021-10-04-07:55:34:INFO - would_leak_1 started! Memory usage: 81.40 MiB
2021-10-04-07:55:56:INFO - would_leak_1 finished! Memory usage: 113.78 MiB
2021-10-04-07:55:56:INFO - would_leak_1 garbage collected! Memory usage: 113.78 MiB
2021-10-04-07:55:56:INFO - would_leak_2 started! Memory usage: 113.78 MiB
2021-10-04-07:56:22:INFO - would_leak_2 finished! Memory usage: 122.85 MiB
2021-10-04-07:56:22:INFO - would_leak_2 garbage collected! Memory usage: 122.85 MiB
2021-10-04-07:56:22:INFO - Done! Memory usage: 122.86 MiB
2021-10-04-07:56:22:INFO - Feeds in memory cleared! Memory usage: 84.39 MiB

Note

If I run would_leak_1 and would_leak_2 separately, their leaking behavior seems the same. However, running them sequentially at a time does make the second-run one leak less under some conditions as you see.

Rongronggg9 commented 2 years ago

I got more data in production.

I have two instances of https://github.com/Rongronggg9/RSS-to-Telegram-Bot on the same VPS. One with ~4000 feeds, another one with ~3000 feeds. The bot will check the updates of feeds frequently. I noticed that the relation between the number of feeds and the amount of memory leakage is a logarithm relation. And parsing the same feed (no matter if it keeps the same or is updated) multiple times leaks less than parsing different feeds once, but when the same feed has been parsed fairly high times, the memory leakage will hardly increase. That is to say, the relation between the number of times of parsing and the amount of memory leakage is also a logarithm relation.

I guess the leaked objects can somehow be reused? If that's true, it will be a helpful clue to figuring out the cause of memory leakage.

Related: https://github.com/kurtmckee/feedparser/pull/302#issuecomment-1133549551

lemon24 commented 2 years ago

Hi, coming here from your comment on #302.

I ran a few tests where I called feedparser.parse() in a loop and measured memory usage (details below). I tried two feeds, one 2M and one 50K, both loaded from disk; I did this both on macOS and on Ubuntu.

The results are as you describe, the max RSS increases in what looks like a logarithmic curve; that is, after enough iterations (10-100), the max RSS remains almost horizontal/stable.

However, I am not convinced this is a memory leak in feedparser.

Rather, I think it's a side-effect of how Python memory allocation works. Specifically, Python never releases allocated memory back to the operating system (1, 2, 3), but keeps it around and reuses it. (Because of this, running gc.collect() will never decrease RSS.)

I assume the initial sharper memory increase is due to fragmentation (even if there's enough memory available, it's not in a contiguous chunk, so the allocator has to allocate additional memory); as more and more memory is allocated and then released (in the pool), it becomes easier to find a contiguous chunk.

It makes sense for #302 to make max RSS stabilize faster, since it reduces the number of allocations – and more importantly, the number of big (whole feed) allocations (which reduces the impact of fragmentation).

It might be possible to confirm this 100% by measuring the used memory as seen by the Python allocator, instead of max RSS.


Script:

import sys, resource
import feedparser

print("    loop    maxrss")

for i in range(10 ** 3 + 1):
    with open(sys.argv[1], 'rb') as file:
        feedparser.parse(file)

    maxrss = (
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    )

    if (i <= 10) or (i <= 100 and i % 10 == 0) or (i <= 1000 and i % 100 == 0):
        print(f"{i:>8}  {maxrss:>8.3f}")
Output: ``` macOS Catalina, Python 3.9.10, feedparser 6.0.8 2.2M feed loop maxrss 0 47.895 1 50.555 2 50.582 3 50.613 4 50.613 5 50.613 6 50.613 7 50.625 8 50.648 9 50.656 10 50.656 20 50.727 30 50.727 40 50.727 50 50.742 60 50.758 70 50.820 80 50.820 90 50.820 100 50.820 52K feed loop maxrss 0 17.297 1 17.484 2 17.566 3 17.645 4 17.777 5 17.836 6 17.891 7 17.949 8 18.008 9 18.094 10 18.152 20 18.172 30 18.188 40 18.242 50 18.277 60 18.285 70 18.324 80 18.336 90 18.344 100 18.352 200 18.359 300 18.387 400 18.410 500 18.438 600 18.461 700 18.461 800 18.461 900 18.465 macOS Catalina, Python 3.9.10, feedparser 6.0.8 + #302 2.2M feed loop maxrss 0 24.578 1 24.578 2 24.578 3 24.578 4 24.578 5 24.578 6 24.578 7 24.578 8 24.578 9 24.578 10 24.578 20 24.578 52K feed loop maxrss 0 17.598 1 17.723 2 17.805 3 17.918 4 18.031 5 18.117 6 18.172 7 18.230 8 18.285 9 18.340 10 18.352 20 18.383 30 18.414 40 18.426 50 18.441 60 18.453 70 18.461 80 18.492 90 18.504 100 18.508 200 18.543 300 18.543 400 18.590 500 18.590 600 18.590 700 18.598 800 18.598 900 18.598 Ubuntu 20.04, Python 3.8.10, feedparser 6.0.8 2.2M feed loop maxrss 0 42.988 1 46.996 2 46.996 3 47.367 4 47.367 5 47.367 6 47.367 7 47.367 8 47.367 9 47.367 10 47.367 20 47.883 30 47.883 40 47.883 50 47.883 52K feed loop maxrss 0 15.832 1 16.090 2 16.137 3 16.188 4 16.191 5 16.191 6 16.191 7 16.191 8 16.195 9 16.195 10 16.195 20 16.227 30 16.238 40 16.246 50 16.258 60 16.320 70 16.332 80 16.395 90 16.406 100 16.406 200 16.457 300 16.457 400 16.457 500 16.457 600 16.586 700 16.586 800 16.586 900 16.586 1000 16.586 Ubuntu 20.04, Python 3.8.10, feedparser 6.0.8 + #302 2.2M feed loop maxrss 0 20.566 1 20.934 2 20.934 3 20.934 4 20.934 5 20.934 6 20.934 7 20.934 8 20.934 9 21.137 10 21.137 20 21.266 30 21.266 40 21.430 50 21.430 60 21.516 70 21.516 80 21.516 90 21.516 100 21.516 52K feed loop maxrss 0 16.355 1 16.688 2 16.715 3 16.871 4 16.898 5 16.922 6 16.922 7 16.922 8 16.926 9 16.926 10 16.926 20 16.965 30 16.977 40 16.988 50 16.996 60 17.031 70 17.043 80 17.055 90 17.062 100 17.066 200 17.066 300 17.070 400 17.070 500 17.070 600 17.070 700 17.070 800 17.070 900 17.078 1000 17.078 ```
Rongronggg9 commented 2 years ago

Hi, @lemon24. Thanks for your share.

I can confirm that your statement "I am not convinced this is a memory leak in feedparser" is true. BeautifulSoup(something, 'html.parser') (html.parser is written in pure Python) "leaks" in the same pattern as feedparser.parse(something), while BeautifulSoup(something, 'lxml') (lxml is written in C) "leaks" nothing. (Would feedparser adopting lxml as a parser backend help reduce the memory usage? Probably, lol.)

However, after confirming the previous statement, I did a deep dive. I believe that your statement "Python never releases allocated memory back to the operating system, but keeps it around and reuses it" is incorrect. Python does release unused memory, but the prerequisite is that it can. It is fragmentation that breaks this prerequisite and is a glibc malloc issue instead of a Python-specific issue. By default, <128KB malloc uses sbrk instead of mmap to allocate memory. Fragment on high address, which was originally allocated by sbrk, prevents memory compaction from releasing low-address-free memory. However, memory allocated by mmap is managed by the OS and comes without such a disadvantage. What's worse, the threshold is dynamic nowadays and can be increased at runtime (up to 4*1024*1024*sizeof(long) on 64-bit systems!). The default malloc policy is actually a space-time tradeoff since the mmap syscall is costly. That's the real reason for the "leakage" and explains why CPython on Windows is not affected. Also explains why the feeds loaded into memory as strings can be released - most of them are larger than 128KB!

In conclusion, your PR (#302) does help reduce the "leakage", but fairly limited. My final solution is shown below.


Prohibiting the usage of sbrk by setting M_MMAP_THRESHOLD to 0 eliminates the "leakage". It is just an experiment, do not set M_MMAP_THRESHOLD to a fairly low value in production or you will face performance issues.

As a solution in production, 16384 (16KB) is a nice value for those concerned about the issue. Even the default initial value 131072 (128KB) helps a lot since setting the value of M_MMAP_THRESHOLD effectively disables its dynamic increment.

1. ctypes

+import ctypes

+libc = ctypes.cell.LoadLibrary("libc.so.6")
+M_MMAP_THRESHOLD = -3
+libc.mallopt(M_MMAP_THRESHOLD, 0)  # effectively prohibit `sbrk`

import gc
import os
...
2022-05-27-01:35:17:INFO - Started! Memory usage: 54.39 MiB
2022-05-27-01:35:17:INFO - Feeds loaded into memory! Memory usage: 80.66 MiB
2022-05-27-01:35:17:INFO - would_leak_1 started! Memory usage: 80.66 MiB
2022-05-27-01:35:44:INFO - would_leak_1 finished! Memory usage: 84.94 MiB
2022-05-27-01:35:44:INFO - would_leak_1 garbage collected! Memory usage: 84.94 MiB
2022-05-27-01:35:44:INFO - would_leak_2 started! Memory usage: 84.94 MiB
2022-05-27-01:36:13:INFO - would_leak_2 finished! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - would_leak_2 garbage collected! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - Done! Memory usage: 85.52 MiB
2022-05-27-01:36:13:INFO - Feeds in memory cleared! Memory usage: 59.30 MiB

2. Environment variables

Note: In this way, even the initialization of Python is affected, so setting the value to 0 consumes more memory to initialize Python. Do not set MALLOC_MMAP_THRESHOLD_ less than 8192 in production, this ensures that the memory consumption will not be larger than a vanilla execution and the performance is mostly not affected.

$ MALLOC_MMAP_THRESHOLD_=0 python script.py
2022-05-27-01:52:03:INFO - Started! Memory usage: 72.52 MiB
2022-05-27-01:52:03:INFO - Feeds loaded into memory! Memory usage: 98.79 MiB
2022-05-27-01:52:03:INFO - would_leak_1 started! Memory usage: 98.79 MiB
2022-05-27-01:52:39:INFO - would_leak_1 finished! Memory usage: 102.91 MiB
2022-05-27-01:52:39:INFO - would_leak_1 garbage collected! Memory usage: 102.91 MiB
2022-05-27-01:52:39:INFO - would_leak_2 started! Memory usage: 102.91 MiB
2022-05-27-01:53:08:INFO - would_leak_2 finished! Memory usage: 103.58 MiB
2022-05-27-01:53:08:INFO - would_leak_2 garbage collected! Memory usage: 103.56 MiB
2022-05-27-01:53:08:INFO - Done! Memory usage: 103.56 MiB
2022-05-27-01:53:08:INFO - Feeds in memory cleared! Memory usage: 77.35 MiB

Ref: https://stackoverflow.com/questions/68225871/python3-give-unused-interpreter-memory-back-to-the-os https://stackoverflow.com/questions/15350477/memory-leak-when-using-strings-128kb-in-python https://stackoverflow.com/questions/35660899/reduce-memory-fragmentation-with-malloc-mmap-threshold-and-malloc-mmap-max https://man7.org/linux/man-pages/man3/mallopt.3.html

Rongronggg9 commented 2 years ago

A better workaround for multithread programs is to replace the ptmalloc from glibc with jemalloc. https://github.com/Rongronggg9/RSS-to-Telegram-Bot/commit/ae69f738cab53f21f4587272cfce5f22915182a6 https://github.com/Rongronggg9/RSS-to-Telegram-Bot/commit/eb07fa91f7ba9f49584f9effea1488bd0142d7b4

jemalloc shows impressive performance while maintaining a high memory recycling rate on multithread programs.

I've changed the title of the issue and would like to keep it open to be a guide for those developers facing the same issue. It would be better if the issue could be documented in the docs.

My conclusion is that to "solve" the issue at the feedparser side, adopting lxml might be the best and easiest solution. For downstream developers, the two workarounds I've described are easy to adopt.