Closed wreise closed 4 years ago
Yep, I am aware of this issue which is the result of the flagser method not taking into account the flag_matrix format and converting it to a sparse matrix. This conversion removes all non-diagonal zeros.
I have a fix and I will make a PR tomorrow.
Btw, I believe you are not setting to 0 the right indices. This is dist_0:
[[0. 0. 1.90211303 1.90211303 1.1755705 0. ] [0. 0. 0. 1.90211303 1.90211303 1.1755705 ] [1.90211303 0. 0. 0. 1.90211303 1.90211303] [1.90211303 1.90211303 0. 0. 0. 1.90211303] [1.1755705 1.90211303 1.90211303 0. 0. 1.1755705 ] [0. 1.1755705 1.90211303 1.90211303 1.1755705 0. ]]
Note the 0 on the corners instead of (6,5) and (5,6).
@wreise What I have done fixes the problem but not in all cases. In an adjacency matrix representation, a 0
have a double meaning. It both means no edge or edge right from the start. Most scipy.sparse
data structure are able to deal with that. If a 0 was purposefully inserted, it will be remembered. But then, if the adjacency matrix is too dense, they are very expensive memory-wise. However, numpy
dense arrays will always treat all 0 as explicit. This is what you want in your case, but it removes the possibility of having a double meaning for 0
s.
@gtauzin , thank you for your detailed answer and noticing my mistake. Given the performance issue that you pointed out, maybe it is better to keep the current, sparse interpretation. An alternative idea would be documenting this behavior - with the stability guarantees, I wouldn't mind.
@ulupo @wreise
I have thought of nicer solutions to simplify the problem of having a dual meaning for 0
s
1)
np.inf
weights that means no edge, leaving 0
-weighted edges with the meaning of an edge appearing at filtration value 0
.0
-weighted edges are all edges appearing at filtration value 0
. Unstored edges then consistently mean that they are np.inf
-weightedThe only problem is that this only makes sense if flag_matrix.dtype if float, which is however usually the case in real-life problems. Even well-documented, this behavior will be confusing for the user.
2)
Another solution would be to allow the user to pass a mask impicit_zeros to the flagser
method when the flag_matrix is dense.
@gtauzin I find np.inf
to be particularly attractive as it is mathematically very meaningful and in line with the behaviour of ripser
. I also like the suggested behaviour in the sparse case: how does the C++ backend currently handle unstored edges when passed sparse matrices?
@ulupo Indeed, it does make sense. But the following problem remains. If you have a dense unweighted graph, you will store it as a matrix of dtype bool. As it is not sparse, storing it in scipy.sparse
format, will be almost 3 times as costly in memory as storing it as np.ndarray
.But passing it as a dense matrix will be understood by the flagser
method as a two step filtration of a fully connected graph. While passing it as a sparse matrix yields the expected behavior. You can't have np.inf
in a np.ndarray
of bools so you're stuck. I am afraid these edge cases will make the documentation very confusing.
What I would prefer is to do both solution together. What do you think of the following:
flag_matrix : ndarray or scipy.sparse matrix, required
Matrix representation of a directed/undirected weighted/unweighted
graph. Diagonal elements are vertex weights. The way zero values are
handled depending on the format of the matrix. If the matrix is a dense
``np.ndarray``, zero values denote zero-weighted edges and ``np.inf``
values absent edges. If the matrix is a sparse ``scipy.sparse`` matrix,
explicitely stored off-diagonal zeros and all diagonal zeros denote
zero-weighted edges. Off-diagonal values that have not been explicitely
stored are treated by ``scipy.sparse`` as zeros but will be understood
as ``np.inf``-valued edges, i.e., edges absent from the filtration.
inifinte_weights : ndarray or scipy.sparse matrix of bool or ``None``,
optional, default: ``None``
Mask representing the indices of the edges that have infinite weights.
If ``None``, contains only ``False``. It is particularly useful if the
`flag_matrix` is a dense ``np.ndarray`` matrix and its dtype does not
allow ``np.inf`` values.
Also, to answer your question, what is passed to C++ flagser is a list of row indices r, column indices c, and associated weight w denoting the fact that the edge (r_i c_i) has weight w_i.
@gtauzin not directly answering the question yet, but I'm thinking that these issues seem to all arise from the fundamental fact that flagser
tries to be smart about deciding whether the input represents a filtered (directed) simplicial complex or a static complex. This is connected to another issue I've had since the beginning, which is that flagser
tries to return Betti numbers even when it returns barcodes. In my mind, this is awkward as returning barcodes suggests the input has been interpreted as a filtered complex, while returning Betti numbers suggests that it was also interpreted as a static one.
Personally, I would like to discuss whether we can separate the behaviour of flagser
more clearly between the two options. One idea would be to add another parameter which indicates how the input should be interpreted (filtration or not) – e.g. one could have persistence=True
or persistence=False
. The return values would also be different, e.g. why return Betti numbers when persistence=True
? One could only return barcodes then.
I am hopeful that this would kill two birds with one stone: clarify the purpose of the flagser
function, and remove ambiguities about the treatment of input. When persistence=True
, one could decide on a standard for sparse matrices where absent entries mean infinite weights and zero weights need to be explicitly set (btw, on this point, I would also argue that diagonal weights should be explicitly set to zero even in the sparse case, but we can discuss this separately).
In my opinion, your example would not be relevant in the case persistence=True
as a two-step filtration where everything is present in the second step is better thought about as not a filtration at all. So one would not want to compute the barcode in such a case, and rather only "static" information. When persistence=False
it would be natural for False
to behave as np.inf
(absent edge) so there is no tension.
The problem arguably remains that for int
dtypes, there is no np.inf
either. Thus, something else should denote the absence of an edge when persistence=False
and the input is an ndarray. The obvious candidate is 0
because this is consistent with False
in the boolean case. This would be OK for sparse matrices too as most of the time people assume absent entries are 0
, but is at odds with the presence of explicitly set zeros in the case when persistence=False
. So I'm not claiming the proposed solution is perfect yet.
@ulupo Thanks for the input!
The betti numbers outputed by flagser
corresponds to the number of infinite bars. I think it can be of interest in both cases (filtration or not). If you imagine that the input graph is a distance matrix thresholded by a max_edge_length
that corresponds to a good guess for the homology of the VR complex at that value to be the same than the homology of the underlying manifold the point clouds are being sampled from. Then it outputs meaningful betti numbers.
Before going any further, I would argue that the current solution already on master needs to be made into a release ASAP. The current version, v0.2.1 is almost unusable. If we do not agree on a solution now, I believe we should go forward with #32 and then #31. And then implement a nicer solution.
That being said, let me summarize here the situation because it is getting quite complex for me to follow. We have different cases (dtype
, format
, persistence
, and diagonal
).
If
dtype = float
:persistence = True
: It is reasonable to think the user wants persistence. If not, the user can easily change its dtype to bool. We can leave the responsability to the user to pass the right boolean matrix as an input to flagser.
If
format = np.ndarray
: Both0
andnp.inf
are possible and denote respectively 0-weighted edges and no edges.If
format = np.ndarray
: Explicit0
and implicit0
denote respectively 0-weighted edges and no edges.The behavior on and off
diagonal
is the same.If
dtype = int
:persistence = True
: It is also reasonable to think the user wants persistence. If not, the user can easily change its dtype to bool. We can leave the responsability to the user to pass the right boolean matrix as an input to flagser.
If
format = np.ndarray
:np.inf
is not possible and0
should most likely denote 0-weighted edges.If
format = np.ndarray
: Explicit0
and implicit0
are both possible and could denote respectively 0-weighted edges and no edges.If
dtype = bool
:persistence = False
: It is reasonable to think the user does not want persistence.
If
format = np.ndarray
:np.inf
is not possible. But aspersistence = True
, it does not matter.False
means no edge offdiagonal
and False-weight on-diagonal
.If
format = np.ndarray
: ExplicitFalse
and implicitFalse
are both possible but both mean the same.False
means no edge offdiagonal
and False-weight on-diagonal
.
Laying down all possibilities, it seems to me that the value of persistence
can always reasonably be inferred from the matrix dtype
. The actual problem we are facing is only with dtype = int
!
I believe we should forget about the infinite_weight
mask and allow np.inf
when dtype = float
and then forbid the user to pass a matrix with dtype = int
.
@gtauzin so if I understand correctly you are arguing that while a persistence
parameter would help remove ambiguities, another way consists of:
1) disallowing int
array input;
2) establishing the rule that float
is the only dtype associated to persistence (and cannot be used for "static" computations), while bool
is always associated to "static" computations (and cannot be used for persistence).
Concerning 2), I am a bit confused by the statements "False
means [...] 0-weight on-diagonal" in the dtype=bool
case. Surely, if bool
types are always associated to no persistence, we should not be talking about weights?
So far, I thought it would be good to support int data types for both filtered and unfiltered situations. In the filtered case, ints are special instances of real numbers so the same logic can apply (modulo issues with np.inf
), and int arrays are more light-weight than their float counterparts (performance). In the unfiltered case, one can see True
as corresponding to 1
and False
to 0
. I agree this second point is weaker. And the first point may be moot also if e.g. C++ flagser always converts weights to floats (is that the case?).
Description
When we set multiple values in the flagser matrix to 0 in the undirected case, pyflagser behaves differently than ripser.
Interestingly, when we replace setting the value used to set the upper and lower diagonal in the flagser matrix by a small, positive one (ex. 1e-8), the result is as expected.
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
macOS-10.15.3-x86_64-i386-64bit Python 3.8.2 (default, Mar 26 2020, 10:43:30) [Clang 4.0.1 (tags/RELEASE_401/final)] pyflagser 0.2.1