Closed vbarrielle closed 3 years ago
What are the requirements of data
and indices
? Do they have to be the same length, or is it enough for them to be as long as indptr
requires? Does indices
outside of the range given by indptr
have to be convertible to usize
?
Minor optimisation: to_owned
of a view could adjust indptr
to start at 0, and contain the minimum amount of entries in indices
and data
What are the requirements of
data
andindices
? Do they have to be the same length, or is it enough for them to be as long asindptr
requires? Doesindices
outside of the range given byindptr
have to be convertible tousize
?
In most algorithms, the only requirement is that for all i < indptr.len() - 1
, indices[indptr[i]..indptr[i + 1]
is not out of bounds, same for data. This imposes that data
and indices
have the same length, but does not impose anything outside that range. However, in the current state, the structure checks of CsMat
are much more restrictive, and it's very probable some implementations depend on these more precise details.
I see two possible lines of action:
I'm lean towards the first approach, as I don't like the performance penalty of the second one.
Minor optimisation:
to_owned
of a view could adjustindptr
to start at 0, and contain the minimum amount of entries inindices
anddata
Yes that looks like a good idea.
A drawback of the zero-copy approach is that equality between two sparse matrices is no longer trivial, because the non-indexed elements are arbitrary. Pretty sure the current implementation exposes this bug.
There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the indptr
storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.
That would technically be a breaking change, but not one that would affect much code in practice I guess.
Equality as defined is given as structural equality, so don't think this is a big problem. The approx trait could be used for semantic equality
Scipy ensures indptr always starts at 0, what does suitesparse do?
Looks like SuiteSparse enforces the indptr to start at 0, for instance in ldl_valid_matrix.
There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the
indptr
storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.That would technically be a breaking change, but not one that would affect much code in practice I guess.
I've started prototyping a trait around that idea, any feedback is appreciated!
How about this approach: https://gist.github.com/mulimoen/c476ee6a2daf5ba9e341d87dfbc46e14? data
and indices
should always be a part of the matrix, indptr[0]
is just a remapping down to zero.
That's a nice insight, no need for a trait with your approach!
Another alternative: indptr
could return a Cow
I really like the concrete type approach with indptr[0]
used as an offset, that makes the type a guarantee of valid data, and it can take its own share of the structure checks. I've started implementing it, it looks like this for now: https://github.com/vbarrielle/sprs/blob/99663dcad21e34e3260b64e37b6e5fa33d14ff5a/src/sparse/indptr.rs
Looking great! I have a couple of ideas below
indptr
, which can be used for FFI, maybe expressed as a Cow
for an improper indptr?ndarray
calls the slice
operation as_slice
, should we follow the same naming?new
consumes storage
on failure, could we return this on failure?ranges
which returns an iterator of std::ops::Range<usize>
which can be directly consumedmap
s can be removed, and we could use unwrap
, we should always have a first
and last
element* I'm missing a form of canonical `indptr`, which can be used for FFI, maybe expressed as a `Cow` for an improper indptr?
Yes this looks like a good idea.
* `ndarray` calls the `slice` operation `as_slice`, should we follow the same naming?
I'll change that for coherence. But it's fallible here, so maybe try_as_slice
?
* `new` consumes `storage` on failure, could we return this on failure?
I'll see if it can be done.
* Internal API idea: a method `ranges` which returns an iterator of `std::ops::Range<usize>` which can be directly consumed
That's interesting indeed, maybe it should be what's returned by iter_outer
? There's one downside to that, as Range
is not Copy
AFAIR, but other than that I like this idea even for a public API.
* A lot of the `map`s can be removed, and we could use `unwrap`, we should always have a `first` and `last` element
I don't think we will always have a first and last element, for instance I don't think we should disallow creating a view around 0 rows. I know in dense linear algebra this kind of edge case can be very useful.
ndarray
s as_slice
returns an Option
, try_as_slice
seems a bit verbose.
A size one indptr
has two elements, it would be "natural" to have a zero sized indptr
with one element. Although a length zero would mean we could implement Default
@mulimoen was that the kind of Cow
-returning you had in mind?
That was exactly what I had in mind, we can now use this in to_owned
on CsMatbase
Hello, just an update to inform that my work on this is going to be a bit slow, my second son is born this week and I'm having much less free time. I still hope to be able to keep working on this though.
My current state is shown is #245, it's not without issues but I think most of the work is there now.
As evidenced by #242 and #243,
middle_outer_views
is not tested enough, and edge cases may happen in numerous algorithms. It's necessary to write tests for many functions involving these partial views, to reveal the presence of possible bugs.