htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

pyarrow roadmap #43

Open bmschmidt opened 3 years ago

bmschmidt commented 3 years ago

Breaking out of https://github.com/htrc/htrc-feature-reader/issues/42.

With parquet backing, this library is fast--fast enough, I think, to sometimes feed straight into tensorflow without any intermediaries using the tf arrow Dataset strategy.

Currently a lot of the roadblocks I'm seeing are around pandas--indexes and just pandas being a little slow at this point. As @organisciak pointed out in that issue, _make_tokencount_df is faster using arrow to pandas than using pandas directly. (18ms vs 20ms on one book I'm testing with.)

But it's just 13ms if you simply stay as a pyarrow table, which is a better fit for many applications anyway.

So for the next version (3.0), I'd suggest moving internal data transfer in this module entirely to pyarrow and having indexed pandas frames be just a user-level convenience. Most of the work in the chunking and tokencount dataframes can be done right now.

I would code this up right now except that there's one place we still need pandas, which is for the frequent grouped sums. (e.g., df.groupby(['page', 'token']).sum()). There a bunch of pyarrow based groupby applications out there--https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops, https://github.com/TomScheffers/wombat, https://github.com/jorgecarleitao/datafusion-python frontend to the Rust-based https://docs.rs/datafusion/3.0.0/datafusion/, etc.--but none have wide adoption, and it seems possible that pyarrow will deploy a C++-based partial SQL implementation that will be the natural fit. (The pyarrow.compute portion of the pyarrow suite already does some of the things this library uses, like lowercasing strings.) In six months or a year, I think this landscape will be a bit clearer, and could speed up a lot of operations here by 50% or more.

organisciak commented 3 years ago

Maybe storing the tokenlists internally as a pyarrow table might eventually make sense. I still want to library to connect to people's pandas workflows on the front end, like you suggest, but I'm down for tweaking the internals. When we return to this, it might also be worth considering how the development of pyarrow drives pandas.

It might also be worth considering if any parts of the library should use dask. Fundamentally, I don't like the idea of embedding multiprocessing or multithreading into the internals, thinking it better that people can run their higher-level code on multiple CPUs without each process exploding to try to use the whole system (like that issue we saw with np.dot in pySRP). However, even for single-process stuff, dask development seems to be ahead of Pandas in things like parquet loading using pyarrow-dataset, which includes really fast row-level filtering.

bmschmidt commented 3 years ago

pyarrow.parquet.read_table() also includes fast row-level filters on read--I've been using it on the parquet google ngrams data, and it's nice.

Been flirting with the idea of a parquet-dataset backend... Just add 'id', 'library', and 'stubbycode' as dummy columns to the existing parquet files (which because of run-length encoding, would take up trivial additional space on disk, though you'd want to be careful not to read them in) and then the existing stubbytree structure matches one of the pyarrow parquetdataset layouts (not sure if it's Hive-compliant and you could address any corpus as a whole, out of memory. Not sure what the advantage would be, but it would certainly be easy to some things I like to, such as "find every page number that has 'evolution' on it."

bmschmidt commented 3 years ago

Oh, the dask row-level filters simply are the pyarrow predicate filters behind a wrapper

organisciak commented 3 years ago

Yup, that's the intent. The standard filters have been partition-wide both for fastparquet and pyarrow, but the rowwise filters are awesome and I wonder why Pandas doesn't have it yet (maybe because it's experimental?).

bmschmidt commented 2 years ago

Pyarrow 7.0, released earlier this month, includes group_by, aggregate, and sum operations written in C++ and running--I think?--multithreaded naturally.

I haven't used the specific functions here, but in general with pyarrow.compute these things tend to work faster, with less memory overhead, and more cleanly than pandas equivalents. polars or duckdb might be faster and/or more feature complete, but IMO that wouldn't justify the overhead of even an optional dependency from this library.

Arrow is currently used for the parquet reading, and as described above is faster for JSON parsing than writing to pandas directly. So if @organisciak et al are amenable, when I get a chance I'm going to take a look at rewiring the current internals of this package to use only arrow dataframes. Public methods would still return pandas frames by default--it's very fast to cast from arrow to pandas--but where useful/possible I would want escape hatches to the underlying arrow tables, and the indexes would be slapped on after aggregation rather than integral to creating the aggregations.

Not yet clear to me if the arrow hatch would be:

  1. Parallel methods with different names (maybe the same names, prefixed w/ arrow_);
  2. An additional argument to the pandas-generating functions.
  3. An additional argument to Volume, Parser, or some other class that causes the pandas-generating function to change what they emit.
organisciak commented 2 years ago

I welcome this.

Solution #1 sounds most user-friendly to me, change the output by argument is already done in various places but the library's grown complex enough that I'd avoid it for future features.

The massivetexts fork has been a defacto dev branch during the SaDDL project, so may be worth checking if anything from there needs to be merged.


From: Benjamin Schmidt @.> Sent: Friday, February 18, 2022 8:25 AM To: htrc/htrc-feature-reader @.> Cc: Peter Organisciak @.>; Mention @.> Subject: [EXTERNAL] Re: [htrc/htrc-feature-reader] pyarrow roadmap (#43)

Pyarrow 7.0, released earlier this month, includes group_by, aggregate, and sum operations written in C++ and running--I think?--multithreaded naturally.

I haven't used the specific functions here, but in general with pyarrow.compute these things tend to work faster, with less memory overhead, and more cleanly than pandas equivalents. polars or duckdb might be faster and/or more feature complete, but IMO that wouldn't justify the overhead of even an optional dependency from this library.

Arrow is currently used for the parquet reading, and as described above is faster for JSON parsing than writing to pandas directly. So if @organisciakhttps://urldefense.com/v3/__https://github.com/organisciak__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCOKoJamM$ et al are amenable, when I get a chance I'm going to take a look at rewiring the current internals of this package to use only arrow dataframes. Public methods would still return pandas frames by default--it's very fast to cast from arrow to pandas--but where useful/possible I would want escape hatches to the underlying arrow tables, and the indexes would be slapped on after aggregation rather than integral to creating the aggregations.

Not yet clear to me if the arrow hatch would be:

  1. Parallel methods with different names (maybe the same names, prefixed w/ arrow_);
  2. An additional argument to the pandas-generating functions.
  3. An additional argument to Volume, Parser, or some other class that causes the pandas-generating function to change what they emit.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/htrc/htrc-feature-reader/issues/43*issuecomment-1044698208__;Iw!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCLI2gPGo$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAAOH3YZSQ4B5Q7PKZRZWGLU3ZQE7ANCNFSM43F5QONQ__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCSJjdIaA$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCishx2y4$ or Androidhttps://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCKyPa18U$. You are receiving this because you were mentioned.Message ID: @.***>

bmschmidt commented 2 years ago

An in-progress FYI here.

I poked at this a little more and noticed an interesting possibility in pyarrow/parquet that I've been using in my work on https://github.com/bmschmidt/nonconsumptive. The format I've settled into there does some fancy formatting to stash pyarrow structArrays--which are data frames--inside individual cells. There I'm storing the results for multiple books in a single frame. But in feature-reader, it's possible to serialize to a one-row parquet file with multiple columns--some of them are just metadata, but the full feature count data can also live in a single cell. This would make it possible, if done carefully, to create a parquet file that contains all of the information in the json.bz2 files, not just the page count information.

That's sketched out here in the pull request I just started. Supplement all the _make_section_feature_df to return arrow with names like _make_section_feature_table.

Create a single row dataframe with all of those, including tokencounts.

Somewhat to my surprise, I still can't beat bz2 compression--there's perhaps a 10-20% penalty still with zstd and compression_level = 15. The unpacking should be zero-copy and therefore extremely fast.

It's also a format that allows putting multiple EF files into a single parquet file that allows fast random access solving my inode problems with zip mucking. For my own purposes I'm pretty sure this is the only way I'll store EF files going forwards.

First stab at a format, not including metadata, is this schema:

nc:line_chars: list<item: struct<seq: int64, section: string, line_position: string, character: string, count: int64>>
nc:tokencounts: list<item: struct<page: uint64, section: string, token: string, pos: string, count: uint32>>
nc:pages: list<item: struct<seq: int32, languages: string, calculatedLanguage: string, version: string>>
nc:sectioninfo: list<item: struct<seq: int64, section: string, tokenCount: int64, lineCount: int64, emptyLineCount: int64, capAlphaSeq: int64, sentenceCount: int64>>

Downsides here: