htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
39 stars 12 forks source link

Provide arrow_counts method on volume to bypass pandas #39

Open bmschmidt opened 3 years ago

bmschmidt commented 3 years ago

Loading from parquet or feather into pandas to create indices is time consuming for tasks where you don't actually want the data in pandas. (E.g., passing counts straight into tensorflow or numpy).

Using a basic benchmark of reading loading 52 random volumes and summing the wordcounts column, here's a comparison of two methods by count in seconds. using arrow.parquet.read_parquet straight into arrow format and summing word counts is almost 10x faster. Feather based approaches can another order of magnitude faster, but that's probably because they avoid unicode tasks altogether while parquet has to unpack. In real life, you have to do the unicode.

I'd propose that this method be a bit less user-oriented than the pandas ones--not support lowercasing, etc. Just a basic wrapper to pull out some columns and then do computation elsewhere.

METHOD A

      v = Volume(id, id_resolver = resolver)
      _ = v.tokenlist()['count'].sum()
feather with zstd takes 4.282841682434082
feather with lz4 takes 4.423848867416382
feather with None takes 3.9383790493011475
parquet with snappy takes 4.217001914978027
parquet with gzip takes 4.160176992416382
parquet with brotli takes 4.148128986358643
parquet with lz4 takes 4.411034345626831
parquet with zstd takes 4.238173246383667

METHOD B

        z = read_parquet_or_feather(path, columns = ['token', 'count'])
        sum = pc.sum(z['count']).as_py()
        if sum:
            total += sum
feather with zstd takes 0.10278081893920898 counting 9222235 in 52 vols
feather with lz4 takes 0.06820297241210938 counting 9222235 in 52 vols
feather with gz takes 0.0074291229248046875 counting 0 in 0 vols
feather with None takes 0.034819841384887695 counting 9222235 in 52 vols
parquet with snappy takes 0.4386928081512451 counting 9222235 in 52 vols
parquet with gzip takes 0.5488269329071045 counting 9222235 in 52 vols
parquet with brotli takes 0.5444929599761963 counting 9222235 in 52 vols
parquet with lz4 takes 0.4098381996154785 counting 9222235 in 52 vols
parquet with zstd takes 0.4021151065826416 counting 9222235 in 52 vols

That is

organisciak commented 3 years ago

This could be a function added to the utils module, perhaps?

bmschmidt commented 3 years ago

I think it makes more sense as a volume method? It's just a different kind of tabular format function, and as valid for json-backed files as anything else.

organisciak commented 3 years ago

In your example, you're deliberately not instantiating a volume, right?

It could work in Volume. I'd love if the documentation was clear that it's for advanced users that only want that count out of the files. The reason: if they run Volume.arrow_counts(), then ask for any more advanced token info, they'll still end up instantiating and caching the tokenlist. Since the primary purpose of this library is scaffolding between EF and Pandas, I expect that wouldn't be too uncommon of a situation.

Maybe call it fast_count()? If I saw that method, I'd understand what it does, then read the docs to figure out what the catch is 😄

bmschmidt commented 3 years ago

Yeah, deliberately avoiding instantiating a volume there just b/c that's the prototype for the method.

Happy with fast_count; another naming option might be raw_counts?

To be clear it wouldn't be just for counts, though; it can return arrow tables with any desired columns. At a parquet backend, it would perform:

 read_parquet_or_feather({self.path}, columns = ['page', 'section', 'token', 'pos', 'count'])

and at a json parser, it would do

self._make_tokencounts_df(arrow = True)['page', 'section', 'token', 'pos', 'count']

Either of those would be faster than self.tokencounts_df().