Closed lemon24 closed 3 years ago
Using this script:
import sys, time, resource
from reader import *
workers = int(sys.argv[1])
reader = make_reader('db.sqlite')
for feed in reader.get_feeds():
reader._storage.mark_as_stale(feed.url)
start = time.monotonic()
for _ in reader.update_feeds_iter(workers=workers):
print('.', end='', file=sys.stderr)
end = time.monotonic()
print(file=sys.stderr)
print(
workers,
int(end - start),
int(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
/ 2 ** (20 if sys.platform == 'darwin' else 10)
),
)
... we see that maxrss increases less than the number of workers; however, the increase is bigger on Linux than on macOS:
# workers seconds maxrss(mib)
# macos
1 195 81
2 93 85
4 67 87
8 44 94
# linux
1 109 73
2 58 88
4 36 114
8 32 136
Update: There's a nice improvement in maxrss on Linux after b044add (the slight increase in time may or may not be because of the change):
# workers seconds maxrss(mib)
# linux
1 109 72
2 52 81
4 40 75
8 39 88
So, bac925f is a more cleaned up version of the prototype (b044add).
I initially wanted to make Parser.retrieve() just return a value instead of being a context manager, but response.close() does more than response.raw.close(). While just returning is more convenient, the context manager is more correct (assuming there are unknown resources that need to be cleaned).
More testing is needed; more specifically, the tests in test_parser.py aren't covering the (fairly complicated) interaction between parallel retrieve() and parse() in Reader._update_feeds().
Here's some measurements for bac925f (I did before again because the internet changed slightly):
macOS:
$ cat before.txt
1 167 81
2 76 82
4 51 86
8 44 95
$ cat after.txt
1 168 81
2 68 83
4 53 86
8 51 89
Linux:
$ cat before.txt
1 107 77
2 53 87
4 36 109
8 32 150
$ cat after.txt
1 99 76
2 51 81
4 40 84
8 39 92
Update: If instead of just __enter__()-ing the retrieve context manager we read the feed into a temporary file, the run time gets back to where it was in before.txt, but with only a slight maxrss increase over after.txt:
# macOS
8 42 85
# Linux
8 32 99
FWIW, I looked a bit at why the run time improvements seem to be tapering off after 4 workers (profile at the end of the comment, note it's only for the main thread).
As one might expect, it's because most of the time is spent parsing feeds. Expat is implemented in C, so we might see an improvement in running parse() in parallel, but only up to the number of cores (obviously, this will likely bring the memory issues back).
Update: No, running parse() in parallel does not make things better, at least not on CPython. Note 45 of the 65 seconds are spent in feedparser.parsers.strict, which has pure-Python callbacks the (C) XML parser calls.
84841721 function calls (84657932 primitive calls) in 70.759 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
371/1 0.012 0.000 71.025 71.025 {built-in method builtins.exec}
1 0.003 0.003 71.025 71.025 261.py:1(<module>)
158 0.001 0.000 70.259 0.445 core.py:790(update_feeds_iter)
158 0.033 0.000 70.211 0.444 core.py:902(_update_feeds)
157 0.001 0.000 65.678 0.418 core.py:997(parse)
156 0.020 0.000 65.640 0.421 _parser.py:183(parse)
156 0.011 0.000 65.607 0.421 _parser.py:462(__call__)
156 0.012 0.000 65.216 0.418 api.py:152(parse)
156 0.002 0.000 63.579 0.408 expatreader.py:103(parse)
156 0.008 0.000 63.571 0.408 xmlreader.py:115(parse)
815 0.003 0.000 63.548 0.078 expatreader.py:206(feed)
815 1.438 0.002 63.543 0.078 {method 'Parse' of 'pyexpat.xmlparser' objects}
81169 0.164 0.000 45.047 0.001 expatreader.py:372(end_element_ns)
81169 0.231 0.000 44.833 0.001 strict.py:108(endElementNS)
85318 0.427 0.000 44.616 0.001 mixin.py:302(unknown_endtag)
85163 1.684 0.000 42.976 0.001 mixin.py:465(pop)
21717 0.079 0.000 40.774 0.002 mixin.py:627(pop_content)
20143 0.097 0.000 36.818 0.002 html.py:146(feed)
40286 3.992 0.000 36.005 0.001 sgmllib.py:110(goahead)
...
Here's some final measurements I did on Linux:
reader==2.5
1 100 89
2 51 98
4 33 123
8 29 150
TemporaryFile()
1 100 88
2 64 80
4 32 99
8 28 110
SpooledTemporaryFile(2**19) (feed size: p50 = 126 KiB, p90 = 625 KiB)
1 99 87
2 46 97
4 32 103
8 28 114
response.raw
1 99 91
2 47 94
4 36 97
8 36 102
On my tiny instance I've been getting OOM kills during
update
, even with just 4 workers.When running with multiple workers, the thing that is parallelized for each feed is retrieve + parse. Of these two, the retrieve part likely takes the most time, since it usually implies network calls. Also, based on this gist, we know parse takes a lot of memory.
If we want to keep memory usage low, it may make sense to only parallelize retrieve, and do parse serially.
In theory,
parse
may do network calls as well (since we're passing it the response (an open socket)). However, I expect that by the time the headers are read (which happens in retrieve), (most of) the rest of the feed is already in the OS buffers somewhere; if not, we could always wrap the stream in something like io.BufferedReader and peek an entire buffer before returning it from retrieve (update: BufferedReader specifically does not work like this).