Don't parse feeds in parallel

lemon24 commented 3 years ago

On my tiny instance I've been getting OOM kills during update, even with just 4 workers.

When running with multiple workers, the thing that is parallelized for each feed is retrieve + parse. Of these two, the retrieve part likely takes the most time, since it usually implies network calls. Also, based on this gist, we know parse takes a lot of memory.

If we want to keep memory usage low, it may make sense to only parallelize retrieve, and do parse serially.

In theory, parse may do network calls as well (since we're passing it the response (an open socket)). However, I expect that by the time the headers are read (which happens in retrieve), (most of) the rest of the feed is already in the OS buffers somewhere; if not, we could always wrap the stream in something like io.BufferedReader and peek an entire buffer before returning it from retrieve (update: BufferedReader specifically does not work like this).

lemon24 commented 3 years ago

Using this script:

import sys, time, resource
from reader import *

workers = int(sys.argv[1])

reader = make_reader('db.sqlite')

for feed in reader.get_feeds():
    reader._storage.mark_as_stale(feed.url)
start = time.monotonic()
for _ in reader.update_feeds_iter(workers=workers):
    print('.', end='', file=sys.stderr)
end = time.monotonic()
print(file=sys.stderr)
print(
    workers,
    int(end - start), 
    int(
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
        / 2 ** (20 if sys.platform == 'darwin' else 10)
    ),
)

... we see that maxrss increases less than the number of workers; however, the increase is bigger on Linux than on macOS:

# workers seconds maxrss(mib)
# macos
1 195 81
2 93 85
4 67 87
8 44 94
# linux
1 109 73
2 58 88
4 36 114
8 32 136

Update: There's a nice improvement in maxrss on Linux after b044add (the slight increase in time may or may not be because of the change):

# workers seconds maxrss(mib)
# linux
1 109 72
2 52 81
4 40 75
8 39 88

lemon24 commented 3 years ago

So, bac925f is a more cleaned up version of the prototype (b044add).

I initially wanted to make Parser.retrieve() just return a value instead of being a context manager, but response.close() does more than response.raw.close(). While just returning is more convenient, the context manager is more correct (assuming there are unknown resources that need to be cleaned).

More testing is needed; more specifically, the tests in test_parser.py aren't covering the (fairly complicated) interaction between parallel retrieve() and parse() in Reader._update_feeds().

lemon24 commented 3 years ago

Here's some measurements for bac925f (I did before again because the internet changed slightly):

macOS:

$ cat before.txt 
1 167 81
2 76 82
4 51 86
8 44 95
$ cat after.txt 
1 168 81
2 68 83
4 53 86
8 51 89

Linux:

$ cat before.txt 
1 107 77
2 53 87
4 36 109
8 32 150
$ cat after.txt 
1 99 76
2 51 81
4 40 84
8 39 92

Update: If instead of just __enter__()-ing the retrieve context manager we read the feed into a temporary file, the run time gets back to where it was in before.txt, but with only a slight maxrss increase over after.txt:

# macOS
8 42 85
# Linux
8 32 99

lemon24 commented 3 years ago

FWIW, I looked a bit at why the run time improvements seem to be tapering off after 4 workers (profile at the end of the comment, note it's only for the main thread).

As one might expect, it's because most of the time is spent parsing feeds. Expat is implemented in C, so we might see an improvement in running parse() in parallel, but only up to the number of cores (obviously, this will likely bring the memory issues back).

Update: No, running parse() in parallel does not make things better, at least not on CPython. Note 45 of the 65 seconds are spent in feedparser.parsers.strict, which has pure-Python callbacks the (C) XML parser calls.

         84841721 function calls (84657932 primitive calls) in 70.759 seconds

   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    371/1    0.012    0.000   71.025   71.025 {built-in method builtins.exec}
        1    0.003    0.003   71.025   71.025 261.py:1(<module>)
      158    0.001    0.000   70.259    0.445 core.py:790(update_feeds_iter)
      158    0.033    0.000   70.211    0.444 core.py:902(_update_feeds)
      157    0.001    0.000   65.678    0.418 core.py:997(parse)
      156    0.020    0.000   65.640    0.421 _parser.py:183(parse)
      156    0.011    0.000   65.607    0.421 _parser.py:462(__call__)
      156    0.012    0.000   65.216    0.418 api.py:152(parse)
      156    0.002    0.000   63.579    0.408 expatreader.py:103(parse)
      156    0.008    0.000   63.571    0.408 xmlreader.py:115(parse)
      815    0.003    0.000   63.548    0.078 expatreader.py:206(feed)
      815    1.438    0.002   63.543    0.078 {method 'Parse' of 'pyexpat.xmlparser' objects}
    81169    0.164    0.000   45.047    0.001 expatreader.py:372(end_element_ns)
    81169    0.231    0.000   44.833    0.001 strict.py:108(endElementNS)
    85318    0.427    0.000   44.616    0.001 mixin.py:302(unknown_endtag)
    85163    1.684    0.000   42.976    0.001 mixin.py:465(pop)
    21717    0.079    0.000   40.774    0.002 mixin.py:627(pop_content)
    20143    0.097    0.000   36.818    0.002 html.py:146(feed)
    40286    3.992    0.000   36.005    0.001 sgmllib.py:110(goahead)
...

lemon24 commented 3 years ago

Here's some final measurements I did on Linux:

reader==2.5 
1 100 89
2 51 98
4 33 123
8 29 150

TemporaryFile()
1 100 88
2 64 80
4 32 99
8 28 110

SpooledTemporaryFile(2**19) (feed size: p50 = 126 KiB, p90 = 625 KiB)
1 99 87
2 46 97
4 32 103
8 28 114

response.raw
1 99 91
2 47 94
4 36 97
8 36 102

lemon24 / reader

Don't parse feeds in parallel #261