osPlanning / omx

Open Matrix (OMX)
https://github.com/osPlanning/omx/wiki
Apache License 2.0
49 stars 18 forks source link

+Arrow/Feather #37

Open e-lo opened 3 years ago

e-lo commented 3 years ago

I'd like to propose that we evaluate the feasibility to support the faster Arrow-based data format.

billyc commented 3 years ago

I second this proposal! Performance is the main problem with OMX as written.

Coincidentally I was thinking of building an arrow-based Python proof of concept for this just last week. Did your proposal come out of some other conversations recently ?

..b

On Mon, Sep 14, 2020, 20:34 Elizabeth Sall notifications@github.com wrote:

I'd like to propose that we evaluate the feasibility to support the faster Arrow https://arrow.apache.org/-based data format.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/osPlanning/omx/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK4QRX53KZ44BWPBCGS4MLSFZO4VANCNFSM4RL55LMA .

e-lo commented 3 years ago

Proposal based on:

  1. Feedback that OMX was too slow to use in production (noted by some in the data standards learning session)
  2. Use of these data formats in other contexts because of their speed benefits / hdf5 not being a standard that most people are adopting these days (that I can tell)
pedrocamargo commented 3 years ago

I would love for the next iteration of OMX to be based on Arrow, but is the objective of OMX to be used in production now?

e-lo commented 3 years ago

is the objective of OMX to be used in production now?

That's a good question for the organizing group (which is who, these days?). In practice, it is being used in production.

pedrocamargo commented 3 years ago

I also use it in production and made AequilibraE capable of using it as well. However, if the OMX mission changes, then I would say it would be worth it to explore other data formats to make sure we get it right. Also, would we ask software providers to switch to the new format? Or will we support both?

billyc commented 3 years ago

I don't see improved performance of OMX as being a change of mission! Our tech should be useful and frictionless, to help spur adoption.

Existing OMX files have a "VERSION 1" key embedded in them, precisely because we wanted the format to be changeable if the need arose. We always knew that performance of HDF5 is not great because of its slow compression library. There just weren't better alternatives at the time.

On Tue, Sep 15, 2020 at 3:40 AM Pedro Camargo notifications@github.com wrote:

I also use it in production and made AequilibraE capable of using it as well. However, if the OMX mission changes, then I would say it would be worth it to explore other data formats to make sure we get it right. Also, would we ask software providers to switch to the new format? Or will we support both?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/osPlanning/omx/issues/37#issuecomment-692409149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK4QRX4NQVF7URQNDQHLKTSF3AZDANCNFSM4RL55LMA .

gregorbj commented 3 years ago

I think that supporting Arrow-based format and other formats in the future is probably necessary if OMX is to endure as anything more than an exchange format. The spec would have to become more abstract. One issue will be how specific should the spec be about data structure. For example it is my understanding (limited) that Arrow supports storage of tabular data in columnar format where each column can store a different data type. This is the approach that VisionEval takes. OMX stores matrix data in a matrix format. So what should the spec say in that regard. There might need to be a part of the specification to deal with each type of backend that is supported: if HDF5 how structured, if Arrow how structured, etc. Or maybe is entirely functional, identifying functions that must be supported.

pedrocamargo commented 3 years ago

@billyc , I was referring to using OMX for in production and not as a common format for transfer between platforms (the latter was my understanding of the mission, but I am probably mistaken and remember only part of the mission).

bstabler commented 3 years ago

I like this idea and I like the idea of discussing this. Anyone interested in discussing please comment on this thread and then we can brainstorm next steps - maybe a meeting to discuss, maybe a prototype, etc. Thanks!

bstabler commented 3 years ago

If we're thinking about a next version, let's include other potential ideas as well - more flexibility, more data types, better API conformity, CI for testing APIs, better viewers, etc.

e-lo commented 3 years ago

@bstabler - perhaps:

  1. create a feature-request issue template
  2. make a call for feature-requests (more broadly than here on github)
  3. ask people to comment about their support
  4. develop a backlog for next version
jpn-- commented 3 years ago

Apparently I was not "watching" and didn't see this conversation initially. Count me in 👍

jeabraham commented 3 years ago

How interesting! HDF5 is primarily a disk storage format, with an option to force in-memory. Arrow is exclusively an in-memory format, right? So, the two are complimentary.

I've never been a big fan of HDF5, but don't see Arrow as a way to get away from HDF5.

Arrow sure would be nice to enable us to use higher performance libraries and not have to go through disk storage just to work in another platform or language for a bit.

e-lo commented 3 years ago

Arrow is exclusively an in-memory format, right?

Feather is its on-disk complement

pedrocamargo commented 3 years ago

And Arrow+Feather is ridiculously fast...

jpn-- commented 3 years ago

And Arrow+Feather is ridiculously fast...

Did some noodling on this over the weekend. +1 to ridiculously fast ... not just "I don't want to wait while the data saves to disk" fast, but bordering on "I don't need to load skims into RAM to use them" fast.

jpn-- commented 3 years ago

Talk is cheap. Here instead is a straw man proposal for you all to beat around a bit. https://github.com/jpn--/arrowmatrix

pedrocamargo commented 3 years ago

Quite impressive results and effort, @jpn-- !

pedrocamargo commented 3 years ago

The development of the PyTables project (on which OMX relies) seems to be quite slow these days, and there doesn't seem to be any hurry in supporting the newly released Python 3.9

https://github.com/PyTables/PyTables/issues/823

jpn-- commented 3 years ago

The development of the PyTables project (on which OMX relies) seems to be quite slow these days, and there doesn't seem to be any hurry in supporting the newly released Python 3.9

I wouldn't worry too hard about not having wheels out on PyPI supporting 3.9 yet. The same applies to plenty of other relevant and very active projects \<cough>pyarrow\<\/cough>. Both have 3.9 support on conda-forge.

pedrocamargo commented 3 years ago

My concern is a little more with the frequency of updates to the library, @jpn-- , but you are right that the 3.9 release in itself is nothing to worry about for now.

amotl commented 3 years ago

Dear Pedro and Jeffrey,

thanks to @avalentino, PyTables-cp39 wheels for Linux are available on PyPI now. See also https://github.com/PyTables/PyTables/issues/823#issuecomment-729116365.

With kind regards, Andreas.

pedrocamargo commented 3 years ago

Has anybody looked further into this change? PyTables still does not have wheels for Python 3.9 for either Windows or macOS, so I would say that the case for migrating to Arrow is getting even better...

bstabler commented 3 years ago

@toliwaga did some further comparisons of HDF5 versus Arrow/Feather for ActivitySim and the performance gains were not great. If I recall correctly, the use case of reading several full matrices into RAM, which is what we're typically doing for activity-based models because we need to random access hundreds of millions of cells as fast as possible, was underwhelming. Maybe @toliwaga can add some more details?

Nevertheless, I'm supportive of developing and releasing an updated version, say v0.3, of OMX that supports either HDF5 or Arrow/Feather because it's popular, supported, and faster under some additional use cases.

billyc commented 3 years ago

It would be great to see the results of those comparisons here, if @toliwaga is willing to share them. Otherwise someone will probably ask for it again :-)

pedrocamargo commented 3 years ago

My concern, besides the fact that HDF5 has lost a lot of momentum in favor of more modern formats such as arrow and feather, is that the use case of just loading all arrays to disk once is a rather narrow one, @billyc

e-lo commented 3 years ago

the use case of just loading all arrays to disk once is a rather narrow one.

Fully agree. Even within the scope of a travel model, there are lots of uses for the matrices used/created in travel models beyond "running the actual model". I'm surprised that there wasn't a significant amount of time saved. Based on some of what I've read there should be time saved on the read/write in addition to I/O as well as significant RAM improvements. The RAM improvements alone should be something to consider as it could reduce the need for specialized "modeling machines".

Another thing to consider is if arrow/feather is the right "storage" mechanism beyond intra-runs or if parquet is (which is considered "archival"). Ideally OMX would deal with either.

billyc commented 3 years ago

Beyond all the above reasons, hdf5 doesn't have any bindings for Javascript (and likely never will) -- so it's literally impossible to access OMX skims from front-end browser code without relying on a node-server to broker any requests.

It sounds like we have more than enough justifications to at least keep exploring this.

jpn-- commented 3 years ago

I'm surprised that there wasn't a significant amount of time saved.

The work by @toliwaga on this was in the context of ActivitySim. Overall the time spent loading HDF5 OMX data in an ActivitySim model is tiny compared to the runtime of the whole model -- cutting the plain load time from say 50 seconds to 10 seconds (not @toliwaga's results, just some approximate numbers from what I've played with) doesn't matter much when running everything else takes hours, and that makes it not worth a ton of development effort on the part of the ActivitySim consortium. But as we all agree, that's just one use case.

So I'd like to invite all of you who are interested to look at the straw man proposal I put forth a few months ago, and particularly the implementation details. Post here some thoughts about what's good and what's bad in there. From some more concrete thoughts perhaps we can move past "yes we should talk more about this" to actually outlining a new set of principles we want to pursue in the next version of the standard.

toliwaga commented 3 years ago

Sorry to be so slow in responding - I took a very long (and wonderful) summer vacation and am only just sorting through all stuff that happened while I was away.

I agree with @jpn that the activitysim use case is not representative and so my observations may have little bearing on this question.

Activitysim is a long running program with many models that do repeated lookups of various skims.

The ordinary use case is that Activitysim loads all of the skims into memory once at the start of the run, stores them in a large 3-dimensional numpy array (which is stored in shared memory when multiprocessing.) The various models access individual skims or skim sets (e.g. drive time for different time periods) via wrappers designed for convenience and legibility in expression files. The initial load time is not very important - what is important is that subsequent skim references are fast and are stored in a way that can be shared across processes.

@jpn presented a straw man proposal that, in addition to other possible advantages, suggested that it might be possible to avoid the runtime and memory overhead of preloading the skims and instead reading them just-in-time for skim lookup. The example showed both good performance and promising near-zero memory footprint.

I played around with that approach to see if it might be possible to use feather files as an alternative to in-memory skims. I decided to investigate, and explored his approach.

The first problem I ran into was that I found that accessing all skims would eventually bring all the skim data into memory. As the documentation says "Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory." This wasn't apparent in the example @jpn provided because it accessed the same skim repeatedly so the gradual increase in memory usage wasn't apparent. I couldn't find any way to free the memory short of opening and closing the file at every access - which slowed the process down.

However, the rapidity of feather file opening suggested a different, analogous approach which I then explored.

I implemented a numpy memmapped skim_dict class as an alternative to the existing activitysim in-memory array version. By opening and closing the memmap file just-in-time to perform skim or dkim_stack lookups, the memmap implementation avoided the 'leakage' associated with Jeff's approach - at the expense of redundant (albeit rapid) loads of skim data.

This resulted in a zero-overhead skim implementation with runtime performance 'only' 60% slower then in-memory skims. (A runtime handicap that could possibly be compensated by the reduced memory requirements in certain implementations. This is worth exploring. I should think it might be of interest to MPOs with truly gigantic skims. Especially if they are more constrained on the memory than the processor side.

# stats below are for a Full run of 3-zone Marin on wrjsofppw01

households_sample_size: # all households
initialize_tvpb num_processes: 20
tour_mode_choice_simulate num_processes: 32?

# skim_dict_factory: NumpyArraySkimFactory
Time to execute run_sub_simulations step mp_tvpb : 670.767 seconds (11.2 minutes)
Time to execute run_sub_simulations step mp_mode_choice : 327.657 seconds (5.5 minutes)?
high water mark rss: 485.85

# skim_dict_factory: MemMapSkimFactory
Time to execute run_sub_simulations step mp_tvpb : 763.762 seconds (12.7 minutes)
Time to execute run_sub_simulations step mp_mode_choice : 525.076 seconds (8.8 minutes)
high water mark rss: 333.78

Disabling tap-tap utility calculation (rebuild_tvpb_cache: False) shows that the memory requirements for 32-processor tour_mode_choice model run are striking low:

Total memory requirements for 32 processor tour_mode_choice model step with MemMapSkimFactory are 145GB - or under 5GB per process

This is all - last I checked - easily turned on and off by simply changing the skim_dict_factory setting in network_los.yaml from NumpyArraySkimFactory (the default) to MemMapSkimFactory.

skim_dict_factory: NumpyArraySkimFactory

skim_dict_factory: MemMapSkimFactory

This will cause Activitysim to create a numpy memmap cache file (if it does not already exist) and which it opens and closes just-in-time for each skim access. This should work in either single or multi process mode.

This was never really exhaustively tested because it was just a little side project I did on my own time - not something that was part of the funded development effort.

bstabler commented 2 years ago

anyone eager to get something going on this topic? I've been too busy to move this along. Thanks.