[Enhancement] Ideas for 2.0?

airbreather commented 7 years ago

Spun off from #7.

From @xivk:

@airbreather Any ideas on improving this library, that can use a lot of work to be honest are very welcome. There is also this:

http://adamsitnik.com/Span/

I'm preparing work on v2 so it's the right time to make big changes.

airbreather commented 7 years ago

Very quick thoughts: the System.IO.MemoryMappedFiles APIs are now in .NET Standard, as of version 2.0.

I think it would be very worthwhile to add a .NET Standard 2.0 target and an implementation of reminiscence's APIs that uses it.
I also think that it would probably be worthwhile to redesign parts of reminiscence so that the System.IO.MemoryMappedFiles implementation is as "thin" / "light" as possible, at the cost of other implementations being "heavier".

Reasoning later... I need to leave for work soon.

xivk commented 7 years ago

Ok, I didn't know that was coming to netstandard 2.0. I'm planning netstandard2.0-only support for Itinero 2.0 anyway.

Perhaps some form of dependency-injection like you did now in Itinero would work best. It's all about Itinero, if we can work without reminiscence then fine by me, but it's core to Itinero's functionality at the moment.

xivk commented 7 years ago

Wow there is even a netstandard1.3 implementation:

https://www.nuget.org/packages/System.IO.MemoryMappedFiles/

airbreather commented 7 years ago

Here's the reasoning for the bullet points I had above.

I think it would be very worthwhile to add a .NET Standard 2.0 target and an implementation of reminiscence's APIs that uses it.

When building a contracted graph (which is the primary story I've been focusing on, because it's very expensive), I have found it to be outright unreasonable to use Reminiscence's file mapping because of all the seeks and tiny reads... even the bucket selection and discontiguous memory access patterns from Reminiscence's MemoryArray<T> has proven to dominate the time, given how much relative improvement that magical "Unsafe" thing from my wiki showed.

So whatever the solution winds up being, I really hope it can leverage the built-in memory-mapped file implementation on whatever platform there is, which works behind-the-scenes exactly like that "Unsafe" thing does: dereference a pointer to a specific spot in virtual memory... the only difference is that if there's a page fault, the OS handles it by fetching a section of the file.

This should be much faster than explicit seeks, throwing away the managed buffered data, etc.

I also think that it would probably be worthwhile to redesign parts of reminiscence so that the System.IO.MemoryMappedFiles implementation is as "thin" / "light" as possible, at the cost of other implementations being "heavier".

The reason I think it should be as light as possible is because the System.IO.MemoryMappedFiles are already, for lack of a better term, "dangerously safe" side. From my understanding, the gains that I'd claimed over there from using UMA have a lot to do with cutting out the crap and just treating this performance-sensitive code the way that C does: move bytes from where they come from to where they need to go, and then dereference a pointer that points to the right ranges of data.

System.IO.MemoryMappedFiles APIs start to go a bit in the other direction on that, but the framework designers know this stuff really well, and they made what they felt to be the right trade-offs for performance vs. safety here.

Perhaps some form of dependency-injection like you did now in Itinero would work best. It's all about Itinero, if we can work without reminiscence then fine by me, but it's core to Itinero's functionality at the moment.

I'm thinking that if you're already considering redesigns here, it might be worthwhile to distinguish between the two major ways in which Reminiscence is used in today's Itinero:

Possibly low-powered platforms want to be able to use a pre-computed network with a contracted graph to get from point A to point B.
High-powered platforms want to build an Itinero-formatted road network and contracted graph as fast as possible (IDP).

Those "low-powered" platforms in 1 want to be able to quickly and efficiently read the data from the file, but they don't read a big percentage of the data in a single operation, and the execution time is probably (I haven't run the numbers) still going to be dominated by the routing algorithms rather than the memory accesses. I don't think it's probably all that important for the main component of Itinero to be able to pre-load huge chunks of the file into memory all at once... you could probably get away with "hardcoding" the fact that these use-cases use memory-mapped files.

Short version of 1, for the "everything's already built" use-cases, you probably want to invest in getting the random-read performance to a place where it's adequate, and that's it. Reminiscence 1.x might already be there for this use case, but I doubt it... I still have yet to finish building a proper planet.osm edge-contracted graph to examine the scaling.

The "high-powered" platforms in 2 hit the whole (potentially huge) data set multiple times. The math fundamentally dictates that at least the contraction part is going to take a long time, and the physics of the problem requires it to take up a lot of temporary storage. However, these guys, I assume, are going to be more than happy to invest in as much RAM as they can get their hands on if it makes this process go faster. What that means for the software, IMO, is that we can (and should) optimize the process so that when we access data from RAM, we do so with as little overhead as possible. I expect there to be on the order of billions of random accesses between building and contracting graph for planet.osm, and it was looking like the memory accesses were still very much a part of the contraction step even after optimizing the hell out of it the best I could reasonably do at the time.

So for the "let's build everything" use-cases, you probably want to invest in making sure that the JIT has as much information as it can, at every oft-used call site, to be able to write out cheap assembly language that the CPU will be able to reason about to do as much OOO black magic as it can to not only pay as little as possible for the memory access itself, but also to do as much as it can that's independent of the result when it becomes available.

airbreather commented 7 years ago

Having said all of the above, I've been toying with an interesting thought: using memory-mapped files fundamentally boils down to getting a magical pointer and then letting the application use it like it uses any other pointer.

If Itinero wants to continue down the route where it uses the same API for the two different use cases, that might be reasonable without meaningfully harming the performance in the IDP use case by simply abstracting out the logic it uses to get the pointer, resize the underlying pointed-to storage, release the pointer, and maybe some other infrequently-used calls, while still keeping the code to dereference the pointer (plus some offset) the same as in my vision, namely that it should probably be non-virtual (edit: and emit potentially unaligned memory reads/writes on platforms that will clean up the mess exactly the way that things like System.BitConverter do today, hoping that our accesses will always be aligned when it matters).

xivk commented 7 years ago

Wow, ok, I went through this once now, I'll have to read it a couple of more times, looks like your really thought this through already.

My priorities are:

Performance after contraction or even uncontracted is the most important.
Contraction itself should be managable, but usually this happens daily or once a week.
Performance on mobile, for the offline routing usecase should be good, it is now.

At the time you did your testing I found a bug in contraction process. I would be surprised if it would have ever finished in that state of Itinero. I'm pretty sure things are better now.

Would you be willing to contribute to some kind of design writeup for v2.0 (Itinero)? Basically focused on replacing the current ArrayBase<> implementation?

I'm also working on scaling the preprocessing process by using a tiled-approach:

https://twitter.com/XIVK/status/893508208458944512

This idea is to process tile by tile, it would also allow Itinero to only update part of the RouterDb while keeping the rest intact. I have an experimental version of this working. I haven't found a solution for a contracted graph yet but have some ideas.

Thanks again for your comments 👍 I'll try and get up to speed on some of what you are saying, perhaps also do some experiments myself.

xivk commented 7 years ago

Can you help me understand this part: If Itinero wants to continue down the route where it uses the same API for the two different use cases

Do you mean, preprocessing & routing or ?

airbreather commented 7 years ago

Can you help me understand this part: If Itinero wants to continue down the route where it uses the same API for the two different use cases

Do you mean, preprocessing & routing or ?

Yes, preprocessing and routing are the two use cases.

I'm at work, so right now I can just do short little messages like this.

airbreather commented 7 years ago

Would you be willing to contribute to some kind of design writeup for v2.0 (Itinero)? Basically focused on replacing the current ArrayBase<> implementation?

Yes, I can commit to doing this within about a week or so from this message.

airbreather commented 7 years ago

I went through this once now, I'll have to read it a couple of more times, looks like your really thought this through already.

The important takeaway is to try to make sure that the design doesn't paint you into a corner. Everything anyone says about performance is contingent on measurements proving that the stuff actually matters. Developer time and expertise is too valuable and modern infrastructure is too complex for it to make sense to seriously invest in stuff that strays too far away from tangible measurements.

My measurements using an earlier version of Itinero did reveal that slightly inefficient accesses to the data stored slightly inefficiently in those huge arrays scaled up to a cost that was on the order of minutes, but I'd never been able to finish an edge-based contraction on planet.osm to prove how much it matters at scale after a certain step, probably because:

At the time you did your testing I found a bug in contraction process. I would be surprised if it would have ever finished in that state of Itinero. I'm pretty sure things are better now.

I'll redo some tests once I get some more free time and see where things stand after that. I definitely don't want to steer you into a big solution to a small problem if that's what my ideas turn out to be.

airbreather commented 7 years ago

Hmm. So, redoing the equivalent to "Huge, Unsafe 4.0.3" as described over there using .NET Core 2.0, with some parallelism in HierarchyBuilder.CalculateQueue (which, in addition to a few other small changes throughout HierarchyBuilder, required demoting DykstraWitnessCalculator's pathTree and pointerHeap fields to local variables wherever they were used), I'm getting this console output. The first column is how many seconds have passed since the start of the operation, and it's still running. Apparently even 64 GB of physical RAM isn't enough to stop Windows 10 from doing that memory compression thing, so it would be faster on a machine with more physical memory to dedicate to this process.

Also, I'm using the numbers from this block, not the numbers from this other block. Not sure what the difference is, but since I'm just using profile.DefaultWeightHandlerCached(db) instead of some unusual weight handler or some other weight type that isn't float, I figured that was the way to go.

I haven't done much profiling yet to figure out what's the specific bottleneck now, but I will say that RouterDb.Deserialize was still a problem when using the default array factory. I'm going to proceed with a proposed design that makes sure to cover the places that use the array factory and/or make a new Array<T>.

airbreather commented 7 years ago

I've jotted down some notes in gitter, and I'm still working through it, but I think the major impediment to System.IO.MemoryMappedFiles is the fact that 32-bit runtimes won't likely have enough addressable virtual memory space to map the needed chunks of the RouterDb files into memory, so we'd need something like this as a fallback anyway for many mobile devices.

We might just have to live with something resembling the current Reminiscence.Arrays.ArrayBase<T>.

That's not to say that there aren't breaking changes that I think might still make sense.

airbreather commented 7 years ago

I think the only overall design change that can make sense in all cases would be to take something like IArrayFactory + DefaultArrayFactory from Itinero and make that the only built-in way for consumers of Reminiscence to create ArrayBase<T> instances.

MemoryArray<T> and Array<T> become internal to Reminiscence, and we add a few creation methods to it so it can create the "mapped" arrays too, the way it's currently used.

This would keep the door open for future changes in Reminiscence to be non-breaking, while also giving external callers the ability to override the implementations to use something faster at their own pace.

The only thing I'm not sure about (and I'm finding it really hard to solve) is a way to do the memory-mapped file version in a way that the factory method(s) could be implemented to use System.IO.MemoryMappedFiles types... but you said that's not a real problem in the case I was somewhat worried about, so I'm going to stop trying to overengineer that one.

Another problem I'm not sure about solving right now is to make a built-in MemoryArray<T> replacement that uses something like that unmanaged implementation I was playing around with. It definitely won't work for variable-length types, but it also has a 32-bit-related limitation: it requires a contiguous block of virtual memory to be able to do its thing, but it's possible that even if there is enough available memory, maybe the heap is so fragmented that there's not a big enough contiguous space.

That's all I think I can offer right now.

airbreather commented 6 years ago

native-memory-array branch has an implementation of some of the interesting ideas from above... it uses C# 7.3, which is coming with VS2017 15.7 (at the time of writing, that version is still in preview).

It won't work for net40 or portable40-net403+sl5+win8+wp8 targets right now, and (by design) it will only work for support unmanaged types.

Assuming my earlier Gitter notes are (still) correct, this latter constraint should still let something like this work in every spot in Itinero that uses Reminiscence types except Itinero.Attributes.AttributesIndex (and the things that use it).

In return for these restrictions, we get what should be extremely fast saving / loading and reasonably fast accesses, though I can't quite get to performance testing for it quite yet. In theory, there are ways to get it even faster on netcoreapp2.1 (by skipping some intermediate copies and making bigger stream reads and writes) and net46 (by using Buffer.MemoryCopy).

The next thing I'll do on this one is jot down some measurements from some performance tests.

xivk commented 5 years ago

Another thing we could be using here:

https://channel9.msdn.com/Shows/On-NET/High-performance-IO-with-SystemIOPipelines

airbreather commented 5 years ago

Sorry about the really long delay, I never quite got the nice uninterrupted spans of time that it needed to finish this off with performance tests that made me really happy, so I had to settle for what I could get. #18 is there now, regardless.

re: System.IO.Pipelines, having experimented with it before, I don't think we have a particularly solid use case for it. For the most part (especially with #18), the performance-critical parts of what I've seen here all deal with binary formats that require little to not parsing which get read in their entirety from start to finish. This is pretty much the ideal situation for the built-in I/O APIs.

airbreather commented 5 years ago

re: "real" memory-mapped files, I've got a very rough start on my real-memory-mapped-files branch.

Needs a LOT of testing still, and I want to try to implement resizing with this later.

The difference between this and what I was originally going for in the other branch is that the ArrayBase<T> instance owns all the dirty stuff that's responsible for making memory-mapped files happen, instead of forcing the caller to manage part of it.

itinero / reminiscence

[Enhancement] Ideas for 2.0? #11