Investigate possible issues with partial views

vbarrielle commented 4 years ago

As evidenced by #242 and #243, middle_outer_views is not tested enough, and edge cases may happen in numerous algorithms. It's necessary to write tests for many functions involving these partial views, to reveal the presence of possible bugs.

mulimoen commented 4 years ago

What are the requirements of data and indices? Do they have to be the same length, or is it enough for them to be as long as indptr requires? Does indices outside of the range given by indptr have to be convertible to usize?

mulimoen commented 4 years ago

Minor optimisation: to_owned of a view could adjust indptr to start at 0, and contain the minimum amount of entries in indices and data

vbarrielle commented 4 years ago

What are the requirements of data and indices? Do they have to be the same length, or is it enough for them to be as long as indptr requires? Does indices outside of the range given by indptr have to be convertible to usize?

In most algorithms, the only requirement is that for all i < indptr.len() - 1, indices[indptr[i]..indptr[i + 1] is not out of bounds, same for data. This imposes that data and indices have the same length, but does not impose anything outside that range. However, in the current state, the structure checks of CsMat are much more restrictive, and it's very probable some implementations depend on these more precise details.

I see two possible lines of action:

commit to the possibility to have zero-copy partial views of a matrix, relax structure checks and increase test coverage
take a shortcut and allocate when trying to slice a matrix. This seems to be scipy's approach.

I'm lean towards the first approach, as I don't like the performance penalty of the second one.

Minor optimisation: to_owned of a view could adjust indptr to start at 0, and contain the minimum amount of entries in indices and data

Yes that looks like a good idea.

vbarrielle commented 4 years ago

A drawback of the zero-copy approach is that equality between two sparse matrices is no longer trivial, because the non-indexed elements are arbitrary. Pretty sure the current implementation exposes this bug.

vbarrielle commented 4 years ago

There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the indptr storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.

That would technically be a breaking change, but not one that would affect much code in practice I guess.

mulimoen commented 4 years ago

Equality as defined is given as structural equality, so don't think this is a big problem. The approx trait could be used for semantic equality

mulimoen commented 4 years ago

Scipy ensures indptr always starts at 0, what does suitesparse do?

vbarrielle commented 4 years ago

Looks like SuiteSparse enforces the indptr to start at 0, for instance in ldl_valid_matrix.

vbarrielle commented 4 years ago

There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the indptr storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.

That would technically be a breaking change, but not one that would affect much code in practice I guess.

I've started prototyping a trait around that idea, any feedback is appreciated!

mulimoen commented 4 years ago

How about this approach: https://gist.github.com/mulimoen/c476ee6a2daf5ba9e341d87dfbc46e14? data and indices should always be a part of the matrix, indptr[0] is just a remapping down to zero.

vbarrielle commented 4 years ago

That's a nice insight, no need for a trait with your approach!

mulimoen commented 4 years ago

Another alternative: indptr could return a Cow

vbarrielle commented 4 years ago

I really like the concrete type approach with indptr[0] used as an offset, that makes the type a guarantee of valid data, and it can take its own share of the structure checks. I've started implementing it, it looks like this for now: https://github.com/vbarrielle/sprs/blob/99663dcad21e34e3260b64e37b6e5fa33d14ff5a/src/sparse/indptr.rs

mulimoen commented 4 years ago

Looking great! I have a couple of ideas below

I'm missing a form of canonical indptr, which can be used for FFI, maybe expressed as a Cow for an improper indptr?
ndarray calls the slice operation as_slice, should we follow the same naming?
new consumes storage on failure, could we return this on failure?
Internal API idea: a method ranges which returns an iterator of std::ops::Range<usize> which can be directly consumed
A lot of the maps can be removed, and we could use unwrap, we should always have a first and last element

vbarrielle commented 4 years ago

* I'm missing a form of canonical `indptr`, which can be used for FFI, maybe expressed as a `Cow` for an improper indptr?

Yes this looks like a good idea.

* `ndarray` calls the `slice` operation `as_slice`, should we follow the same naming?

I'll change that for coherence. But it's fallible here, so maybe try_as_slice?

* `new` consumes `storage` on failure, could we return this on failure?

I'll see if it can be done.

* Internal API idea: a method `ranges` which returns an iterator of `std::ops::Range<usize>` which can be directly consumed

That's interesting indeed, maybe it should be what's returned by iter_outer? There's one downside to that, as Rangeis not Copy AFAIR, but other than that I like this idea even for a public API.

* A lot of the `map`s can be removed, and we could use `unwrap`, we should always have a `first` and `last` element

I don't think we will always have a first and last element, for instance I don't think we should disallow creating a view around 0 rows. I know in dense linear algebra this kind of edge case can be very useful.

mulimoen commented 4 years ago

ndarrays as_slice returns an Option, try_as_slice seems a bit verbose.

A size one indptr has two elements, it would be "natural" to have a zero sized indptr with one element. Although a length zero would mean we could implement Default

vbarrielle commented 4 years ago

@mulimoen was that the kind of Cow-returning you had in mind?

mulimoen commented 4 years ago

That was exactly what I had in mind, we can now use this in to_owned on CsMatbase

vbarrielle commented 4 years ago

Hello, just an update to inform that my work on this is going to be a bit slow, my second son is born this week and I'm having much less free time. I still hope to be able to keep working on this though.

vbarrielle commented 3 years ago

My current state is shown is #245, it's not without issues but I think most of the work is there now.

sparsemat / sprs

Investigate possible issues with partial views #244