giotto-ai / pyflagser

Python bindings and API for the flagser C++ library (https://github.com/luetge/flagser).
Other
13 stars 14 forks source link

Zero values outside of the diagonal #27

Closed wreise closed 4 years ago

wreise commented 4 years ago

Description

When we set multiple values in the flagser matrix to 0 in the undirected case, pyflagser behaves differently than ripser.

Interestingly, when we replace setting the value used to set the upper and lower diagonal in the flagser matrix by a small, positive one (ex. 1e-8), the result is as expected.

Steps/Code to Reproduce

import numpy as np
from pyflagser import flagser
from gtda.homology import VietorisRipsPersistence
from sklearn.metrics import euclidean_distances

nb_points = 5
t = np.linspace(0, 1, nb_points + 1)
x, y = np.cos(2*np.pi*t - 0.1), np.sin(2*np.pi*t - 0.1)
X = np.stack([x, y], axis=1)
dist_0 = euclidean_distances(X)

dist_0[range(0, nb_points-1), range(1, nb_points)] = 0.
dist_0[range(1, nb_points), range(0, nb_points-1)] = 0.

flagser(dist_0, min_dimension=0, max_dimension=1, directed=False, coeff=2)['dgms'][1]

Expected Results

array([[1.17557049, 1.90211308]])

Actual Results

array([[1.90211308,        inf]])

Versions

macOS-10.15.3-x86_64-i386-64bit Python 3.8.2 (default, Mar 26 2020, 10:43:30) [Clang 4.0.1 (tags/RELEASE_401/final)] pyflagser 0.2.1

gtauzin commented 4 years ago

Yep, I am aware of this issue which is the result of the flagser method not taking into account the flag_matrix format and converting it to a sparse matrix. This conversion removes all non-diagonal zeros.

I have a fix and I will make a PR tomorrow.

gtauzin commented 4 years ago

Btw, I believe you are not setting to 0 the right indices. This is dist_0:

[[0. 0. 1.90211303 1.90211303 1.1755705 0. ] [0. 0. 0. 1.90211303 1.90211303 1.1755705 ] [1.90211303 0. 0. 0. 1.90211303 1.90211303] [1.90211303 1.90211303 0. 0. 0. 1.90211303] [1.1755705 1.90211303 1.90211303 0. 0. 1.1755705 ] [0. 1.1755705 1.90211303 1.90211303 1.1755705 0. ]]

Note the 0 on the corners instead of (6,5) and (5,6).

gtauzin commented 4 years ago

@wreise What I have done fixes the problem but not in all cases. In an adjacency matrix representation, a 0 have a double meaning. It both means no edge or edge right from the start. Most scipy.sparse data structure are able to deal with that. If a 0 was purposefully inserted, it will be remembered. But then, if the adjacency matrix is too dense, they are very expensive memory-wise. However, numpy dense arrays will always treat all 0 as explicit. This is what you want in your case, but it removes the possibility of having a double meaning for 0s.

wreise commented 4 years ago

@gtauzin , thank you for your detailed answer and noticing my mistake. Given the performance issue that you pointed out, maybe it is better to keep the current, sparse interpretation. An alternative idea would be documenting this behavior - with the stability guarantees, I wouldn't mind.

gtauzin commented 4 years ago

@ulupo @wreise I have thought of nicer solutions to simplify the problem of having a dual meaning for 0s

1)

The only problem is that this only makes sense if flag_matrix.dtype if float, which is however usually the case in real-life problems. Even well-documented, this behavior will be confusing for the user.

2) Another solution would be to allow the user to pass a mask impicit_zeros to the flagser method when the flag_matrix is dense.

ulupo commented 4 years ago

@gtauzin I find np.inf to be particularly attractive as it is mathematically very meaningful and in line with the behaviour of ripser. I also like the suggested behaviour in the sparse case: how does the C++ backend currently handle unstored edges when passed sparse matrices?

gtauzin commented 4 years ago

@ulupo Indeed, it does make sense. But the following problem remains. If you have a dense unweighted graph, you will store it as a matrix of dtype bool. As it is not sparse, storing it in scipy.sparse format, will be almost 3 times as costly in memory as storing it as np.ndarray.But passing it as a dense matrix will be understood by the flagser method as a two step filtration of a fully connected graph. While passing it as a sparse matrix yields the expected behavior. You can't have np.inf in a np.ndarray of bools so you're stuck. I am afraid these edge cases will make the documentation very confusing.

What I would prefer is to do both solution together. What do you think of the following:

flag_matrix : ndarray or scipy.sparse matrix, required
    Matrix representation of a directed/undirected weighted/unweighted
    graph. Diagonal elements are vertex weights. The way zero values are
    handled depending on the format of the matrix. If the matrix is a dense
    ``np.ndarray``, zero values denote zero-weighted edges and ``np.inf``
    values absent edges. If the matrix is a sparse ``scipy.sparse`` matrix,
    explicitely stored off-diagonal zeros  and all diagonal zeros denote
    zero-weighted edges. Off-diagonal values that have not been explicitely
    stored are treated by ``scipy.sparse`` as zeros but will be understood
    as ``np.inf``-valued edges, i.e., edges absent from the filtration.

inifinte_weights : ndarray or scipy.sparse matrix of bool or ``None``,
    optional, default: ``None``
    Mask representing the indices of the edges that have infinite weights.
    If ``None``, contains only ``False``. It is particularly useful if the
    `flag_matrix` is a dense ``np.ndarray`` matrix and its dtype does not
    allow ``np.inf`` values.

Also, to answer your question, what is passed to C++ flagser is a list of row indices r, column indices c, and associated weight w denoting the fact that the edge (r_i c_i) has weight w_i.

ulupo commented 4 years ago

@gtauzin not directly answering the question yet, but I'm thinking that these issues seem to all arise from the fundamental fact that flagser tries to be smart about deciding whether the input represents a filtered (directed) simplicial complex or a static complex. This is connected to another issue I've had since the beginning, which is that flagser tries to return Betti numbers even when it returns barcodes. In my mind, this is awkward as returning barcodes suggests the input has been interpreted as a filtered complex, while returning Betti numbers suggests that it was also interpreted as a static one.

Personally, I would like to discuss whether we can separate the behaviour of flagser more clearly between the two options. One idea would be to add another parameter which indicates how the input should be interpreted (filtration or not) – e.g. one could have persistence=True or persistence=False. The return values would also be different, e.g. why return Betti numbers when persistence=True? One could only return barcodes then.

I am hopeful that this would kill two birds with one stone: clarify the purpose of the flagser function, and remove ambiguities about the treatment of input. When persistence=True, one could decide on a standard for sparse matrices where absent entries mean infinite weights and zero weights need to be explicitly set (btw, on this point, I would also argue that diagonal weights should be explicitly set to zero even in the sparse case, but we can discuss this separately).

In my opinion, your example would not be relevant in the case persistence=True as a two-step filtration where everything is present in the second step is better thought about as not a filtration at all. So one would not want to compute the barcode in such a case, and rather only "static" information. When persistence=False it would be natural for False to behave as np.inf (absent edge) so there is no tension.

The problem arguably remains that for int dtypes, there is no np.inf either. Thus, something else should denote the absence of an edge when persistence=False and the input is an ndarray. The obvious candidate is 0 because this is consistent with False in the boolean case. This would be OK for sparse matrices too as most of the time people assume absent entries are 0, but is at odds with the presence of explicitly set zeros in the case when persistence=False. So I'm not claiming the proposed solution is perfect yet.

gtauzin commented 4 years ago

@ulupo Thanks for the input!

The betti numbers outputed by flagser corresponds to the number of infinite bars. I think it can be of interest in both cases (filtration or not). If you imagine that the input graph is a distance matrix thresholded by a max_edge_length that corresponds to a good guess for the homology of the VR complex at that value to be the same than the homology of the underlying manifold the point clouds are being sampled from. Then it outputs meaningful betti numbers.

Before going any further, I would argue that the current solution already on master needs to be made into a release ASAP. The current version, v0.2.1 is almost unusable. If we do not agree on a solution now, I believe we should go forward with #32 and then #31. And then implement a nicer solution.

That being said, let me summarize here the situation because it is getting quite complex for me to follow. We have different cases (dtype, format, persistence, and diagonal).

If dtype = float: persistence = True: It is reasonable to think the user wants persistence. If not, the user can easily change its dtype to bool. We can leave the responsability to the user to pass the right boolean matrix as an input to flagser.

  • If format = np.ndarray: Both 0 and np.inf are possible and denote respectively 0-weighted edges and no edges.

  • If format = np.ndarray: Explicit 0 and implicit 0 denote respectively 0-weighted edges and no edges.

  • The behavior on and off diagonal is the same.

If dtype = int: persistence = True: It is also reasonable to think the user wants persistence. If not, the user can easily change its dtype to bool. We can leave the responsability to the user to pass the right boolean matrix as an input to flagser.

  • If format = np.ndarray: np.inf is not possible and 0 should most likely denote 0-weighted edges.

  • If format = np.ndarray: Explicit 0 and implicit 0 are both possible and could denote respectively 0-weighted edges and no edges.

If dtype = bool: persistence = False: It is reasonable to think the user does not want persistence.

  • If format = np.ndarray: np.inf is not possible. But as persistence = True, it does not matter. False means no edge off diagonal and False-weight on-diagonal.

  • If format = np.ndarray: Explicit False and implicit False are both possible but both mean the same. False means no edge off diagonal and False-weight on-diagonal.

Laying down all possibilities, it seems to me that the value of persistence can always reasonably be inferred from the matrix dtype. The actual problem we are facing is only with dtype = int!

I believe we should forget about the infinite_weight mask and allow np.inf when dtype = float and then forbid the user to pass a matrix with dtype = int.

ulupo commented 4 years ago

@gtauzin so if I understand correctly you are arguing that while a persistence parameter would help remove ambiguities, another way consists of: 1) disallowing int array input; 2) establishing the rule that float is the only dtype associated to persistence (and cannot be used for "static" computations), while bool is always associated to "static" computations (and cannot be used for persistence).

Concerning 2), I am a bit confused by the statements "False means [...] 0-weight on-diagonal" in the dtype=bool case. Surely, if bool types are always associated to no persistence, we should not be talking about weights?

So far, I thought it would be good to support int data types for both filtered and unfiltered situations. In the filtered case, ints are special instances of real numbers so the same logic can apply (modulo issues with np.inf), and int arrays are more light-weight than their float counterparts (performance). In the unfiltered case, one can see True as corresponding to 1 and False to 0. I agree this second point is weaker. And the first point may be moot also if e.g. C++ flagser always converts weights to floats (is that the case?).