htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Reducing memory consumption in the core wordcount loop #42

Open bmschmidt opened 3 years ago

bmschmidt commented 3 years ago

Running this in an HPC environment, I keep getting jobs killed for overflowing memory requirements. I'm allocating 1-2GB per process, which seems like it should be enough to parse one book at a time. It's a little tricky to figure out exactly where it happens, but it seems to be happening on longer books as a result of memory overuse in the SRP library and here. My guess is that it's in the _make_tokencount_df function, which is optimized for speed but also seems to allocate a lot of memory.

Taking as an example a 900 page book that mixes latin and greek on most pages:

a_long_book = Volume("njp.32101010942850")
a_long_book.tokenlist().memory_usage()

gives something like 17 MB in the pandas frame, and the arrow method writes to disk uncompressed at 22MB.

The big arr allocation here takes about 260MB. I fear it might get doubled or tripled inside df = pd.DataFrame(arr[:i]) later on.

%%timeit

import numpy as np
tname = 'tokenPosCount'

pages = a_long_book.parser._pages
m = sum([page['tokenCount'] for page in pages])
arr = np.zeros(m, dtype=[(str('page'), str('u8')),
                         (str('section'), str('U6')),
                         (str('token'), str('U64')),
                         (str('pos'), str('U6')),
                         (str('count'), str('u4'))])
i = 0

for page in pages:
    for sec in ['header', 'body', 'footer']:
        if page[sec] is None:
            continue
        for token, posvalues in page[sec][tname].items():
            for pos, value in posvalues.items():
                arr[i] = (page['seq'], sec, token, pos, value)
                i += 1
                if (i > m+1):
                    logging.error("This volume has more token info "
                                  "than the internal representation "
                                  "allows. Email organisciak@gmail.com"
                                  "to let the library author know!")
df = pd.DataFrame(arr[:i])

That huge 64-character unicode thing (which I suspect is taking up 128 bytes per row) is the culprit.

802 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Change

Using the same strategy but deferring type coercion until after arrays are built, and handling it through pyarrow, rather than numpy, can speed things up while also avoiding massive memory allocation. (Or maybe because it avoids over-allocation?)

%%timeit

import numpy as np
tname = 'tokenPosCount'
import pyarrow as pa

pages = a_long_book.parser._pages

pagenums, sections, tokens, poses, counts = [[],[], [], [], []]

i = 0
for page in pages:
    pname = page['seq']
    for sec in ['header', 'body', 'footer']:
        if page[sec] is None:
            continue
        for token, posvalues in page[sec][tname].items():
            for pos, value in posvalues.items():
                i += 1
                pagenums.append(pname)
                sections.append(sec)
                tokens.append(token)
                poses.append(pos)
                counts.append(value)

tb = pa.table(
    {
      'page': pa.array(pagenums, pa.uint64()),
      'section': pa.array(sections, pa.utf8()),
      'token': pa.array(tokens),
      'pos': pa.array(poses),
      'count': pa.array(counts, pa.uint32())
    })

df2= tb.to_pandas()

661 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's also about 33% faster on short texts.

I'll push a change that includes this improvement without the pyarrow dependency and one or two other little tweaks.

organisciak commented 3 years ago

Yeah, that was the fastest code for 2013 or whenever I wrote and profiled umpteen different versions of that module. That wacky code was the secret sauce for why this library worked faster than custom solutions, but definitely time to update it!

to_pandas may be faster because it's using multithreading and supposedly is generally faster than a direct Dataframe init anyway. That's fine, but it may be obscuring optimizations that we're leaving on the table. That list appending syntax is one of the approaches that I profiled way back, and though Python's newest versions may be different, it didn't hold a candle to using numpy with pre-allocated memory. Have you tried running your code with numpy arrays -> parrow arrays -> dataframe, rather than incrementally sized lists that were cast to pyarrow arrays?

This post seems to suggest it would work well: Ultrafast pandas DataFrame loading from Apache Arrow · Ursa Labshttps://ursalabs.org/blog/fast-pandas-loading/. For that matter, for comparison's sake, what is the performance if you start from the original code but simply run through pyarrow (e.g. pa.table(arr[:i]).to_pandas()), as that post suggests? (Corollary: does that avoid your memory issue?)

By the way, I've never run into a memory error on any system that I've used - have you checked on other systems? It is odd for a function that pre-allocates memory to fail on memory for only on some books. Do you have a specific volume that you can send for debugging? Or maybe there's an odd quirk somewhere in your install, like a compile error that had an inefficient fallback. Did you install numpy via conda? I haven't tried installing via pip in a half-decade, so things may have changed, but in the past optimal SciPy stack installs were near impossible via pip, and I'm still too scarred to try.

Finally, it may be sensible to consider a 'low_memory' flag, which runs a slower version of the code that doesn't do the 'pre-allocate a big array then truncate it' approach (if, after switching to pyarrow, that's still the faster approach). Since I've never run into the issue on any computer, including Chromebooks, small VMs and whatever duct-tape-and-cardboard machines workshop attendees have thrown at it, I'd prefer a speed-first approach that a user can turn off if necessary. The other possibility is pre-allocating memory for a smaller array, but taking the performance hit of resizing it only for the books that actually are long enough to push it.

bmschmidt commented 3 years ago

In terms of memory--after pushing this fix, I started getting the same thing later in the stack trace inside the SRP module. Debug code prints a bit longer, so I think it's possible this code triggers OOM errors, but I'm not completely sure.

As for why an error here: mostly because I'm trying to explore the floor on memory requirements to please the job scheduler. I'm sure it would run fine at 3GB of RAM or something. But there's no reason this library shouldn't work in 500-800M of RAM. In NYU HPC a job gets killed for exceeding memory for even a second, while on most PCs I think it would just temporarily go to swap for one in a thousand volumes and you wouldn't notice the performance hit.

Note on strategies in a bit.

edit

Oh, in terms of

It is odd for a function that pre-allocates memory to fail on memory for only on some books

that's because the pre-allocated array is scaled to shape m = sum([page['tokenCount'] for page in pages]).

This is a book with almost 1,000,000 words; so that allocates 64 1e6 UNICODE_POINT_SIZE bytes. If numpy is using UTF-32 encoding internally, that would be 256MB of memory, which just about perfectly explains the memory use here.

bmschmidt commented 3 years ago

That McKinney blog post is interesting. It looks as though the internal numpy block your old code allocated was working to get ahead of the problem in internal pandas allocation he mentions. I had switched the backend code back to pandas from pyarrow to avoid a pyarrow dependency because it's about as fast, but it turns out that here too it's faster to construct pa and then cast to pd. Are you OK with adding a core pyarrow dependency to this library?

Since posting this I sped it up a little more switching to this--i.e., bulk filling a numpy array for the ones with long repetitive arrays, but keeping the list append for string and count. Gets it a little faster yet.

        tokens, poses, counts = [[], [], []]
        m = sum([page['tokenCount'] for page in pages])
        pagenums = np.zeros(m, np.uint64)
        sections = np.zeros(m, 'U6')
        i = 0
        for page in pages:
            page_start = i
            pname = page['seq']
            for sec in ['header', 'body', 'footer']:
                section_start = i
                if page[sec] is None:
                    continue
                for token, posvalues in page[sec][tname].items():
                    for pos, value in posvalues.items():
                        i += 1
                        tokens.append(token)
                        poses.append(pos)
                        counts.append(value)
                sections[section_start:i].fill(sec)
            pagenums[page_start:i].fill(pname)

Preallocating tokens in the same way--into an np.str--and then assigning in the inner loop slows things down. (I suspect because you have to convert from str to `np.strtopa.utf8`. Pre-allocating the list in python vs building it on-the-fly doesn't seem to make a difference.)

I suspect there's some more water to wring out of the stone on packing the tokens into an array. (Rewrite the json parser so that it doesn't decode the token names from utf-8, and then we can ship the bytes straight into pyarrow without ever decoding unicode at all!) But I think it might be worth waiting a bit on the pyarrow ecosystem first. Will file a roadmap on that as a separate issue.

organisciak commented 3 years ago

Yup, please add pyarrow as a dependency.