pysal / libpysal

Core components of Python Spatial Analysis Library
http://pysal.org/libpysal
Other
265 stars 78 forks source link

Weights sprint planning #534

Closed ljwolf closed 4 months ago

ljwolf commented 1 year ago

I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the geographs branch, the weights/experimental module. The ROADMAP.md will need updating...

martinfleis commented 1 year ago

Things we have right now:

What needs to be done:

Where we're blocked by scipy.sparse development

My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward.

martinfleis commented 1 year ago

I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review.

ljwolf commented 1 year ago

I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

martinfleis commented 1 year ago

csgraph does seem to generally require matrix for now.

If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?

Maybe as on-the-fly conversion when necessary?

ljwolf commented 1 year ago

We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the kernel() function.

ljwolf commented 1 year ago

the kernel() function may also need to implement a "threshold" option that lets you keep only points closer than threshold as pairs.

ljwolf commented 1 year ago

Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint.

knaaptime commented 1 year ago

TLDR:

I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s

so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a from_contiguity constructor, but i'm sure I heard Levi say we won't have such a thing in the new structure (and analogously we won't have from_kernel. So, while we need to "hook kernel constructors to the base class", that doesnt mean creating a from_kernel class method thats analogous to from_contiguity

Instead, the idea is that from_contiguity will become private, and we will create a larger class method that's from_dataframe? So like, currently, the kernel functions take an argument called coordinates which gets passed to a geometry validator. That means the input must be a geodataframe, not, say a distance matrix. So even though we want to allow distance=precomputed, that wont work because L109 is forcing an array of geoms

so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like

W.from_dataframe(**kernel_args?), because we want to avoid W.from_kernel(**input_data_args?)? if that's the case, then the primary purpose of W.from_dataframe() is to dispatch to helper functions that generate the correct data structure for different types of weights? Then it will end up subsuming from_contiguity and to create a queen weight, i'd do something like W.from_dataframe(gdf, 'contiguity', queen=True)?, conversely if i needed a kernel from precomputed distances id do something like W.from_csc(distance_matrix, 'kernel', kernel_function='gaussian')?, whereas if i needed it to build on the fly with euclidean distances i'd dl W.from_dataframe(gdf, 'kernel', kernel_function='gaussian'). This all applies to triangulation as well.

The alternative would be to leave desired output as first class, which would mean leaving from_contiguity intact, then adding analogous from_kernel, from_triangulation, etc

ljwolf commented 1 year ago

I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry!

If from_contiguity is already there, run with that and add them for now.

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint?

knaaptime commented 1 year ago

From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like

W.from_dataframe(gdf, 'contiguity', queen=True)

that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types)

knaaptime commented 1 year ago

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

ljwolf commented 1 year ago

Maybe “from_dataframe” should be the constructor for an arbitrary adjacency table?

Get Outlook for iOShttps://aka.ms/o0ukef


From: eli knaap @.> Sent: Thursday, July 20, 2023 6:12:03 PM To: pysal/libpysal @.> Cc: Levi John Wolf @.>; Mention @.> Subject: Re: [pysal/libpysal] Weights sprint planning (Issue #534)

This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

— Reply to this email directly, view it on GitHubhttps://github.com/pysal/libpysal/issues/534#issuecomment-1644290716, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR4YZZN42OQ5YHBEZKW3XRFRGHANCNFSM6AAAAAA2OCBVBU. You are receiving this because you were mentioned.Message ID: @.***>

sjsrey commented 1 year ago

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf? https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L25-L36

as later on geoms is passed for validation:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79

Not sure if the docs were to be updated or the check requires some preprocessing as geoms is undefined as currently implemented. This made me think the first argument is intended to be something other than documented at present?

martinfleis commented 1 year ago

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf)

and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

ljwolf commented 1 year ago

@ljwolf question on the signature of delaunay

Should the first arg be an array or a geoseries/gdf?

The input should be able to be either an n,2 array of coordinates, a geoseries of POINT, or a numpy array with POINT shapely geometries. The input coordinates should be validated using:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_utils.py#L25-L29

and the output of that is used for the computation.

So, everywhere where we use _validate_geometry_input() in _triangulation, for example:

https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L77-L79

geoms should be coordinates

ljwolf commented 1 year ago

I don't think having one from_dataframe method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.

I think that constructors like from_triangulation could easily accept both GeoPandas objects and an array of coords.

or do we take a generic w from input W.from_dataframe(gdf) and define a conversion to_kernel, to_voronoi, etc?

No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.

I'm not sure I properly understand the problem though.

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
    • my_w = kernel(my_gdf.geometry, k=3)
    • my_w = delaunay(my_coordinates).
  2. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
    • W.from_networkx(nx_graph)
    • W.from_file('./columbus.gal')
    • W.from_sparse(my_coo)
    • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
    • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

martinfleis commented 1 year ago

We document/recommend

w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

ljwolf commented 1 year ago

We document/recommend w = weights.contiguity.rook(gdf)

Mmm, not sure I am fan of that. The solution with from_contiguity and similar is elegant in a sense that you always create W with some W.from* constructor and you don't have fish in the documentation and different sub modules how to create one.

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

From @ljwolf https://github.com/pysal/libpysal/issues/534#issuecomment-1644268455

I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?

From @knaaptime https://github.com/pysal/libpysal/issues/534#issuecomment-1644278338

yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".

knaaptime commented 1 year ago

My preferred option would be:

  1. the constructor functions (kernel(), vertex_set_intersection(), rook(), delaunay(), gabriel()) take GeoSeries or arrays of coordinates, and return plain W objects. So, calls look like
  • my_w = kernel(my_gdf.geometry, k=3)
  • my_w = delaunay(my_coordinates).
  1. the W class itself uses from_* to read/convert existing objects into a W structure. So, calls look like:
  • W.from_networkx(nx_graph)
  • W.from_file('./columbus.gal')
  • W.from_sparse(my_coo)
  • W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
  • W.from_arrays(my_df.here, my_df.there, my_df.strength)

No W.from_dataframe(), so that there's no confusion about what "dataframe" means.

So, instead of

W.from_dataframe(gdf, 'contiguity', queen=True)

We document/recommend

w = weights.contiguity.rook(gdf)

i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and _from classmethods when you already have the encoding laid out in an existing data structure. Thats a change of pace, but it makes sense to me

What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?

maybe this. I do think it would make things easier to do from libpysal import weights and be able to tab complete to see there's a rook function, etc. That makes it a little clearer rook is definitely a weights builder

knaaptime commented 1 year ago

What is the issue with this API design? I'm still not getting the original problem with that that needs solving.

@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe? whats the signature for instantiating a kernel-based W from a precomputed distance matrix? (since we agree _from_kernel doesnt make sense)?

ljwolf commented 1 year ago

kernel weight from a precomputed distance matrix

That should be

weights.kernel(D, metric="precomputed", kernel="gaussian")

martinfleis commented 1 year ago

I originally assumed that it means from_kernel and from_triangulation methods. But if you don't think that it is a good idea let's figure out a better option.

I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code)

martinfleis commented 1 year ago

This is what I had in mind.

    @classmethod
    def from_kernel(
        cls,
        data,
        bandwidth=None,
        metric="euclidean",
        kernel="gaussian",
        k=None,
        p=2,
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries over which to compute a kernel. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates. If metric="precomputed", data is assumed to contain a
            precomputed distance metric.
        bandwidth : float (default: None)
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates.
        metric : string or callable (default: 'euclidean')
            distance function to apply over the input coordinates. Supported options
            depend on whether or not scikit-learn is installed. If so, then any
            distance function supported by scikit-learn is supported here. Otherwise,
            only euclidean, minkowski, and manhattan/cityblock distances are admitted.
        kernel : string or callable (default: 'gaussian')
            kernel function to apply over the distance matrix computed by `metric`.
            The following kernels are supported:
                - triangular:
                - parabolic:
                - gaussian:
                - bisquare:
                - cosine:
                - boxcar/discrete: all distances less than `bandwidth` are 1, and all
                    other distances are 0
                - identity/None : do nothing, weight similarity based on raw distance
                - callable : a user-defined function that takes the distance vector and
                    the bandwidth and returns the kernel: kernel(distances, bandwidth)
        k : int (default: None)
            number of nearest neighbors used to truncate the kernel. This is assumed
            to be constant across samples. If None, no truncation is conduted.
        ids : numpy.narray (default: None)
            ids to use for each sample in coordinates. Generally, construction functions
            that are accessed via W.from_kernel() will set this automatically from
            the index of the input. Do not use this argument directly unless you intend
            to set the indices separately from your input data. Otherwise, use
            data.set_index(ids) to ensure ordering is respected. If None, then the index
            from the input coordinates will be used.
        p : int (default: 2)
            parameter for minkowski metric, ignored if metric != "minkowski".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally.

    @classmethod
    def from_triangulation(
        cls,
        data,
        method="delaunay",
        bandwidth=np.inf,
        kernel="boxcar",
        clip="extent",
        contiguity_type="rook",
    ):
        """_summary_

        Parameters
        ----------
        data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
            geometries containing locations to compute the
            delaunay triangulation. If a geopandas object with Point
            geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
            with shapely geoemtry is used, then the coordinates are extracted and used.
            If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
            coordinates.
        method : str, (default "delaunay")
            method of extracting the weights from triangulation. Supports:
                - delaunay
                - gabriel
                - relative_neighborhood
                - voronoi
        bandwidth : _type_, optional
            distance to use in the kernel computation. Should be on the same scale as
            the input coordinates, by default numpy.inf
        kernel : str, optional
            kernel function to use in order to weight the output graph. See
            :meth:`W.from_kernel` for details. By default "boxcar"
        clip :str (default: 'bbox')
            An overloaded option about how to clip the voronoi cells passed to
            cg.voronoi_frames() when method="voronoi. Ignored otherwise.
            Default is ``'extent'``. Options are as follows.

            * ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
                arbitrarily larger that the source map. Note that this may lead to
                cells that are many orders of magnitude larger in extent than the
                original map. Not recommended.
            * ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
                the bounding box of the input points.
            * ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
                of the input points.
            * ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
                that contains all points (e.g. the smallest alphashape, using
                ``libpysal.cg.alpha_shape_auto``).
            * Polygon -- Clip to an arbitrary Polygon.
        contiguity_type : str, optional
            What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
            Ignored otherwise. Supports "rook" and "queen", by default "rook".

        Returns
        -------
        W
            libpysal.weights.experimental.W
        """

@ljwolf I think that for voronoi, we don't need to expose rook and queen constructors and can go with the vertex method only because the diagram is always topologically precise.

Was there any discussion on these last time that concluded that this is not an optimal API?

knaaptime commented 1 year ago

Was there any discussion on these last time that concluded that this is not an optimal API?

yeah, lets discuss the tradeoffs on friday. Given the initial layout in the experimental branch, what you sketched above was my original understanding too. But I think we decided the from_* nomenclature is a bit misleading. When you do from_kernel above, you pass the function a geodataframe, not a kernel. The same with triangulation; the first argument is either a geometry object or a list of coordinates, then an intermediate kernel/triangulation is constructed immediately before giving you a weights object. So it's not really weights object from_kernel, it's kernel from_dataframe to weights object.

I can see arguments both ways. It feels natural to do .from_kernel in line with from_contiguity, but the semantics of that aren't actually accurate

martinfleis commented 1 year ago

Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually.

ljwolf commented 1 year ago

So, on the sprint call, we considered two following APIs:

all constructors "live" on the W

In this example, the core design principle is:

W is self-contained. Most of what you need to do can be done only with the W object, not requiring the weights namespace.

The W.build_...() idiom constructs a W under a known graph construction rule, such as:

W.build_rook(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry)
W.build_kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

The W.from_...() idiom constructs a W from a specific datastructure

W.from_file("columbus.gal")
W.from_arrays(head, tail, weight)
W.from_sparse(my_coo_array)

To make good on this design decision, we should also implement a .higher_order(), a .lag(), etc. This is a pandas-style API, rather than expecting users to switch between functions in weights and methods on W. This allows for method-chaining, such as:

W.build_kernel(geoseries).transform('r').lag(geoseries.home_value)

all constructors "live" in a non-object namespace

In this example, the core design principle is

W is only a results container. Functions in weights operate on top of W objects and generally construct or mutate them.

We would have to implement a construct namespace, so that:

weights.construct.contiguity(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry) by default
# or the constructor actually lives at `weights.construct.contiguity`
weights.construct.kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)

An ingest (convert?) namespace takes a dataframe and "ingests" it to a W:

weights.ingest.file("columbus.gal")
weights.ingest.arrays(head, tail, weight)
weights.ingest.sparse(my_coo_array)

and any of our functions in weights act on W objects to build (or mutate) W objects. This is a numpy-style API.

@knaaptime @martinfleis @ljwolf @rongboxu agree on the method strategy, option 1.

ljwolf commented 1 year ago

On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too.

ljwolf commented 1 year ago

For handling coincident points, we discussed the following options:

jitter input upon coincidence

randomly permute/jitter the input points if there is a coincident point

construct a clique

reduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule.

raise every time that there is a coincident point

detect that there is a coincident point and raise an error if that is the case.

@knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy.

All weights constructors should take a "coincident" argument that can take three values:

sjsrey commented 1 year ago

@sjsrey @knaaptime @ljwolf @martinfleis agreed to rename W to Graph and move the namespace.

ljwolf commented 1 year ago

I've just realised: for islands, the nonzero property will count them as a nonzero, since self.sparse.nnz counts "explicit" zeros as nonzero. This might throw off calculations in join counts, since a self-join is always going to contribute to positive autocorrelation (other cases, like Getis-Ord and Moran, we need it)...

We might want to think about using n_edges/n_nodes and then define nonzero (number of non-self loops) as self.n_edges - len(self.isolates)

martinfleis commented 1 year ago

n_edges is just len(self._adjacency) no?

ljwolf commented 1 year ago

yep

knaaptime commented 1 year ago

id ordering not respected when roundtripping through sparse

https://gist.github.com/knaaptime/d17b2d233c663aeec3e0e95487910560

martinfleis commented 1 year ago

@knaaptime regarding this roundtripping - the distance band with isolates or without isolates works fine. The Graphs are equal, the sorting within a single focal does not matter. Sparse is also equal based on both (original and roundtripped).

The issue with KNN has nothing to do with sorting or roundtripping but with duplicated point locations. KNN builder drops 4 observations. That will be partially solved by the application of #548 to KNN. I will also add length checks to raise when the ids != length of the sparse.

martinfleis commented 1 year ago

More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique() is equal to the index of original observations from which the Graph is being built. That ensures that any conversion depending on the order is correct since it will either depend on _id2i based on .unique() or on factorize, resulting in the same.

Edit: This works under an assumption that every observation is present in focal, either with a neighbor or as a self-loop with 0.

knaaptime commented 1 year ago

More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique() is equal to the index of original observations from which the Graph is being built.

exactly. We always need to respect the index of the original input, which means sometimes we are forced to sort. That's where i ended up in #513 too, so i just used a reindex on those ids at both levels, which is the same as keeping id2i around. Groupby ops will break that ordering, so we need to be able to reindex by the focal ordering arbitrarily (which is why i let sort_neighbors stick around). That way 'the position of the first observation of each focal' is always respected

I was going directly from adj to sparse, so had to keep track of the column indices as well otherwise you hit the pivot issue

ljwolf commented 1 year ago

I think that, insofar as we have used the term, "sorting" refers to re-ordering the weights into an arbitrary order with respect to the input data. Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data. I don't think column ordering/tail ordering can be guaranteed unless ids are provided?

knaaptime commented 1 year ago

Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.

isnt this what im saying?

I don't think column ordering/tail ordering can be guaranteed unless ids are provided?

i dont think im following. If no ids are provided, then the ordering of the columns is the range index of the rows? (if no ids are provided, then the index is the id?).

if we always know that both rows and colums of sparse are ordered in the canonical sorting given by the rows of the input, isnt that ideal?

knaaptime commented 1 year ago

if we always sort both levels of the adjlist by the order given in the input data, then its in the same conceptual structure as a sparse matrix with both rows and cols arranged i through j. That way, when you pivot from table to sparse everything is in the right place without needing to specify row/col labels

martinfleis commented 1 year ago

Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.

If we create _id2i on __init__, then the adjacency can be ordered as random as it can get. sparse does not depend directly on the order of adjacency but on the _id2i mapping. The only other thing apart from unique_ids that is used to create _id2i that currently assumes some sorting is to_W which can be refactored. Then with stuff like cliques, we take the original index, create _id2i and then just simply dump cliques at the end of the adjacency, no sorting needed.

This means that we either expose this on a subset of constructors (like from_arrays) or always derive it from given data, posing some assumptions on how the input data should look like.

knaaptime commented 1 year ago

right we can always create id2i from the input data, so the adjlist should always respect that order. In the current structure, the adjlist is sometimes random, but sparse is always ordered according to id2i. I think thats inconsistent and its better to just say id2i is the sort order. And i think its much easier to rely on pandas indices for handling indexing rather than rolling from scratch

we're not posing any assumptions on the input data on my implementation, we're just saying the adjlist order should match the sparse order

knaaptime commented 1 year ago

sparse does not depend directly on the order of adjacency but on the _id2i mapping.

so, since id2i is static because it needs to match the input order, it means sparse is static, so why is adjlist arbitrary? That just creates one more index we need to map between. Lets just use a single index, since its the only one that matters (which was the original idea of using a sparse multiindex series)

martinfleis commented 1 year ago

So you would enforce sorting of the adjacency and drop id2i, since we can infer it from the order?

knaaptime commented 1 year ago

sparse does not depend directly on the order of adjacency but on the _id2i mapping.

yeah, i think this is my point

So you would enforce sorting of the adjacency and drop id2i, since we can infer it from the order?

i think so, but maybe we stash it for convenience. I think id2i forces us to consider that there is an inherent order to the data

The whole idea of using a data frame as the central structure is to avoid doing the mapping and indexing ourselves. Why in the world would we let the adjlist be random when we're using the indices to provide structure? if id2i is structured, and sparse is structured, but adjlist is random (at the neighbor level), then how is that the 'canonical' representation of the graph? Lets just get it into the correct shape once, and let adjacency be the actual central construct.

The only other thing apart from unique_ids that is used to create _id2i that currently assumes some sorting is to_W which can be refactored.

and i think maybe this is the crux of it. We're still struggling with whether the graph is ordered or not. Since the ordering of the rows matters (needs to match input order for things like lag), then the graph is ordered at one level. If we are forced to recognize that order, then lets actually recognize it. We already have a formally-ordered output in sparse--and that order matters, so it's not true that to_W is the only thing that depends on ordering...

Since formal order matters in sparse, why allow an arbitrarily-ordered one in adjacency?. I think our cognitive dissonance is showing if we say order matters for one representation, but the 'canonical data structure' at the center of the graph is arbitrary.

martinfleis commented 1 year ago

I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.

martinfleis commented 1 year ago

Also, sorting on init can not be just sort_values(['focal', 'neighbor']) but sort using the original order of the index of geometries.

knaaptime commented 1 year ago

I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.

yes, well said. I think we're sorting no matter what, so 'explicit is better than implicit' and i'd prefer to do it on init (and say we don't sort after)

Also, sorting on init can not be just sort_values(['focal', 'neighbor']) but sort using the original order of the index of geometries.

so on init, i would do something like this. Set the index as ['focal','neighbor'], then reindex both levels by geometry index and reset the index

...and luckily, this is what raised the issue 😁. So instead of having this happen only in from_adjacency, i'd have it in the init

martinfleis commented 1 year ago

One thing this doesn't solve is the existence of id2i, because for any other index than RangeIndex(0, n), we still need, even on the fly, mapping from ids to integers when going to sparse. That trick with pandas sparse you used before is applicable only for sparse.matrix, not sparse.array, so I'm not sure we want to depend on that for now.

knaaptime commented 1 year ago

ah snap. bummer. but i think thats ok... lets keep id2i as a private attribute and use that as our 'reindexer'?

so, without changing much code: id2i is the order of the graph, so all representations can be expected to follow that structure. Since i was using a multiindexed series instead of a dataframe, the first level of the index defines the order of the graph

martinfleis commented 1 year ago

Probably It is also used as inverse i2id in stuff that is computed on sparse, like asymmetry. We could do them on the fly based on the sorted index but then a private cached property as we have now is probably a better solution.