lehins / massiv

Efficient Haskell Arrays featuring Parallel computation
BSD 3-Clause "New" or "Revised" License
384 stars 23 forks source link

Provide external file (mmaped) representation #108

Open permeakra opened 3 years ago

permeakra commented 3 years ago

I have two use cases in mind. The first is (limited) persistence, allowing access to raw data. The second is working with datasets, exceeding memory size by several orders of magnitude. In both cases it might be desirable to allow several massives in a single file. Probably, interaction with madvise could be of use.

lehins commented 3 years ago

You can already do it with a bit of wrapper code. There is a mmap package that allows you to get ahold of a ForeignPtr to mmaped file with something liek mmapFileForeignPtr in: https://hackage.haskell.org/package/mmap-0.5.9/docs/System-IO-MMap.html I haven't tried this package myself, but it if it worked 7 years ago, I don't see why it shouldn't work now ;)

There is no need for a special representation, once you have a ForeignPtr in hand you can wrap into a S massiv array with something like unsafeMArrayFromForeignPtr0

There is no point in providing this sort of functionality in massiv directly because mmap is very much OS specific and I wanna stay OS agnostic as much as possible. However I might consider a helper package that does just this massiv-mmap or something.

Let me know how it goes if you do figure it out or hit me up on gitter if you do get stuck https://gitter.im/haskell-massiv/Lobby

I'll keep this ticket opened in case I find time to experiment with it and create such a package in a future.

permeakra commented 3 years ago

Thanks for reply

There is no point in providing this sort of functionality in massiv directly because mmap is very much OS specific and I wanna stay OS agnostic as much as possible. However I might consider a helper package that does just this massiv-mmap or something.

If you expect massiv to be used in numerics code (Personally, I consider it as my best bet for the project I'm currently planning), you should keep in mind that such code often has to deal with data sets exceeding available RAM by orders of magnitude. If we construct a massiv representing such data the way you described, it would have different cost model of various access patterns that are different from purely in-memory massives. In addition, specialized prefetch calls are available for mmaped files. Given than, it makes sense to have specialized algorithms for mmaped representation.

lehins commented 3 years ago

If you expect massiv to be used in numerics code, you should keep in mind that such code often has to deal with data sets exceeding available RAM by orders of magnitude.

@permeakra I didn't say this functionality isn't useful. I said that it should not be implemented in massiv package. It should instead be done in a separate package that integrates with massiv interface. The difference is subtle, but very important. I am all for making massiv able handle huge data. It is however not yet on my priority list.

If we construct a massiv representing such data the way you described, it would have different cost model of various access patterns that are different from purely in-memory massives. In addition, specialized prefetch calls are available for mmaped files. Given than, it makes sense to have specialized algorithms for mmaped representation.

It makes sense to have a new representation to account for different usage patterns, I certainly agree with that, but one way or another it will have to be a representation that is a wrapper around ForeignPtr, so the approach I described is required step in implementing this.

This is how I would implement this representation: newtype instance Array MM ix e = MMapArray (Array S ix e)