Open bmschmidt opened 3 years ago
This could be a function added to the utils
module, perhaps?
I think it makes more sense as a volume
method? It's just a different kind of tabular format function, and as valid for json-backed files as anything else.
In your example, you're deliberately not instantiating a volume, right?
It could work in Volume. I'd love if the documentation was clear that it's for advanced users that only want that count out of the files. The reason: if they run Volume.arrow_counts()
, then ask for any more advanced token info, they'll still end up instantiating and caching the tokenlist. Since the primary purpose of this library is scaffolding between EF and Pandas, I expect that wouldn't be too uncommon of a situation.
Maybe call it fast_count()
? If I saw that method, I'd understand what it does, then read the docs to figure out what the catch is 😄
Yeah, deliberately avoiding instantiating a volume there just b/c that's the prototype for the method.
Happy with fast_count; another naming option might be raw_counts
?
To be clear it wouldn't be just for counts, though; it can return arrow tables with any desired columns. At a parquet backend, it would perform:
read_parquet_or_feather({self.path}, columns = ['page', 'section', 'token', 'pos', 'count'])
and at a json parser, it would do
self._make_tokencounts_df(arrow = True)['page', 'section', 'token', 'pos', 'count']
Either of those would be faster than self.tokencounts_df()
.
Loading from parquet or feather into pandas to create indices is time consuming for tasks where you don't actually want the data in pandas. (E.g., passing counts straight into tensorflow or numpy).
Using a basic benchmark of reading loading 52 random volumes and summing the wordcounts column, here's a comparison of two methods by count in seconds. using arrow.parquet.read_parquet straight into arrow format and summing word counts is almost 10x faster. Feather based approaches can another order of magnitude faster, but that's probably because they avoid unicode tasks altogether while parquet has to unpack. In real life, you have to do the unicode.
I'd propose that this method be a bit less user-oriented than the pandas ones--not support lowercasing, etc. Just a basic wrapper to pull out some columns and then do computation elsewhere.
METHOD A
METHOD B
That is