rust-ndarray / ndarray

ndarray: an N-dimensional array with array views, multidimensional slicing, and efficient operations
https://docs.rs/ndarray/
Apache License 2.0
3.53k stars 297 forks source link

Memmap supported? For accessing large data #756

Closed daCapricorn closed 4 years ago

daCapricorn commented 4 years ago

Numpy supports memory-map

bluss commented 4 years ago

ArrayViewMut can be created from arbitrary slice or raw pointer and shape information, so it's possible that way. The same size limits still apply (can't access memory with greater distance from start to end than isize:::MAX).

daCapricorn commented 4 years ago

@bluss Thank you. I'll try it.

bluss commented 4 years ago

@daCapricorn we'd be curious to hear about how it works, what doesn't work and performance and such things!

goertzenator commented 2 years ago

I am also interested in memmap, but would love to have the ndarray own the associated file and memmap object. Could we have an owned array representation that takes any type implementing AsRef([A])?

bluss commented 2 years ago

This topic is relevant: https://users.rust-lang.org/t/is-there-no-safe-way-to-use-mmap-in-rust/70338 and we should probably say that using mmap correctly is hard, we don't really know how it can be done completely safely in Rust. More to come, not from ndarray, but from ecosystem-wide discussion of how it can be used.

@goertzenator Since it's AsRef it's by reference and not appropriate for an owned array.

goertzenator commented 2 years ago

Great link, thanks!

I mentioned AsRef because Cursor will take ownership of an AsRef-implementing object. For example, Cursor will handle a Vec<u8> and Mmap since both implement AsRef<[u8]>. Can this kind of flexibility be brought to ndarray? (or maybe there are optimization/performance requirements that prevent this kind of thing?)

bluss commented 2 years ago

The question is general enough that I think ArrayView::from already does what you want. Otherwise, more specific questions needed. :slightly_smiling_face:

jturner314 commented 2 years ago

I think the question is basically whether we could have an array type which uses e.g. memmap2::Mmap as the storage type instead of OwnedRepr. Without this capability, using memory-mapped files can be inconvenient. You can convert &'a Mmap -> &'a [u8] -> &'a [A] -> ArrayView<'a, A, D>, but you have to hold onto the Mmap somewhere because the ArrayView is borrowing it. Dropping the Mmap closes the memory map, similar to how dropping an OwnedRepr frees the allocation.

We can't accept arbitrary types which implement AsMut<[T]> as owned storage types, since ArrayBase stores its own pointer to the first element, and AsMut doesn't guarantee that .as_mut() returns the same reference every time. We could however create our own unsafe trait:

/// Similar to `AsMut<[T]>`, but with safety requirements.
///
/// # Safety
///
/// The implementing type must ensure that calling `.consistent_as_mut_slice()`
/// on a given value of the type always returns a slice with the same pointer and length,
/// even if the data in the slice is modified.
pub unsafe trait ConsistentAsMutSlice<T> {
     fn consistent_as_mut_slice(&mut self) -> &mut [T];   
}

and then allow types which implement this trait to be storage types for ArrayBase. Or, we could implement the necessary *Data* traits for the Mmap type.

@goertzenator You may be interested in ndarray_npy::{ViewNpyExt, ViewMutNpyExt}, which simplify working with memory-mapped files in .npy format. You can create an .npy file larger than the available memory using ndarray_npy::write_zeroed_npy. When working with a memory-mapped file, you still have to hold onto the Mmap/MmapMut handle separately, though.

adamreichold commented 2 years ago

We can't accept arbitrary types which implement AsMut<[T]> as owned storage types, since ArrayBase stores its own pointer to the first element, and AsMut doesn't guarantee that .as_mut() returns the same reference every time. We could however create our own unsafe trait:

Just wanted to add that this already exists in the stable_deref_trait crate and that memmap2::MapMut implements it.

goertzenator commented 2 years ago

I'd like to leave a side story about how memmap found a new way to bite me:

I am reading a custom format consisting of header information followed by up to 10's of GB of u16s. It turns out this format makes no guarantees about alignment, and the offset to the main data block was in fact odd. numpy.memmap had no problem with this and was unbeknownst to me returning misaligned arrays. Misalignment is slow on x86, and fatal on at least armv7. Fast forward to my switchover from numpy to Rust and align_to() reveals my alignment problem by providing one less u16 than expected. I will be setting memmap aside for the time being to avoid these troubles...