Closed ljwolf closed 4 months ago
Things we have right now:
from_arrays
, from_sparse
, from_old_w
(for ease of testing), from_weights_dict
, from_dicts
), all channeled through from_arrays
.from_contiguity
)What needs to be done:
Where we're blocked by scipy.sparse development
My suggestion would be to focus on testing right now. The code is becoming too large and it shall be thoroughly covered before we start moving forward.
I also think that we should start making PRs into the geography branch rather than direct commit to ensure proper review.
I will try to finish scipy/scipy#18544, which should unblock higher order. We will need to also identify why connected components fails in the csgraph side, but that should be doable tomorrow.
If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?
csgraph does seem to generally require matrix for now.
If necessary, we can build on top of the matrix classes for now and migrate once the arrays are feature complete?
Maybe as on-the-fly conversion when necessary?
We are currently not sure if there is an "adaptive" bandwidth parameter allowed in the kernel()
function.
the kernel()
function may also need to implement a "threshold" option that lets you keep only points closer than threshold
as pairs.
Not going to be able to complete scipy/scipy#18544 today, but will aim to next sprint.
TLDR:
I'm trying to wrap my head around the idea of input data structures versus desired output as first-class citizens for constructing $W$s
so, after our call, i think the idea is to always treat the input data structure as primary. Right now, we have a from_contiguity
constructor, but i'm sure I heard Levi say we won't have such a thing in the new structure (and analogously we won't have from_kernel
. So, while we need to "hook kernel constructors to the base class", that doesnt mean creating a from_kernel
class method thats analogous to from_contiguity
Instead, the idea is that from_contiguity
will become private, and we will create a larger class method that's from_dataframe
? So like, currently, the kernel functions take an argument called coordinates
which gets passed to a geometry validator. That means the input must be a geodataframe, not, say a distance matrix. So even though we want to allow distance=precomputed
, that wont work because L109 is forcing an array of geoms
so if we want to create a kernel-based W, and we are treating input data structure as first class, then i need to do something like
W.from_dataframe(**kernel_args?)
, because we want to avoid W.from_kernel(**input_data_args?)
? if that's the case, then the primary purpose of W.from_dataframe()
is to dispatch to helper functions that generate the correct data structure for different types of weights? Then it will end up subsuming from_contiguity
and to create a queen weight, i'd do something like W.from_dataframe(gdf, 'contiguity', queen=True)
?, conversely if i needed a kernel from precomputed distances id do something like W.from_csc(distance_matrix, 'kernel', kernel_function='gaussian'
)?, whereas if i needed it to build on the fly with euclidean distances i'd dl W.from_dataframe(gdf, 'kernel', kernel_function='gaussian')
. This all applies to triangulation as well.
The alternative would be to leave desired output as first class, which would mean leaving from_contiguity
intact, then adding analogous from_kernel
, from_triangulation
, etc
I did not recall we had implemented some of them already. That's my misunderstanding, so I was mistaken. Sorry!
If from_contiguity is already there, run with that and add them for now.
From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type? I hoped to avoid typing like this in favor of the construction functions outputting plain W objects, since we get into tricky inheritance consistency issues (like, is a kernel weighed voronoi an instance of rook, kernel, Delaunay, or all?). But, we may need to chat about this next sprint?
From a design angle, I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?
yeah, completely agree. You dont do from_kernel
, because your input data isnt "a kernel".
so the issue is we want a generic W class, and it should be instantiated explicitly from different data structures. But there are also genuinely different subtypes of W, as they mean different things and require different computations to create them. Before, we had subclasses for each type of W, and they inhereited the same constructors for different input formats. I agree we want to avoid weird inheritance, and subclassing W more generally, but then the only way i can think to handle the combination of w_type * data_structure is to slightly overload the arguments for data-structure-based class methods, like
W.from_dataframe(gdf, 'contiguity', queen=True)
that gets messy because it needs to accept all the kwargs for constructing all subtypes of W from that data structure, but its conceptually strauightforward (and the converse of our previous design, where w-subtype was first-class, and it inhereted constructors for the shared inuput data types)
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel
, to_voronoi
, etc?
Maybe “from_dataframe” should be the constructor for an arbitrary adjacency table?
Get Outlook for iOShttps://aka.ms/o0ukef
From: eli knaap @.> Sent: Thursday, July 20, 2023 6:12:03 PM To: pysal/libpysal @.> Cc: Levi John Wolf @.>; Mention @.> Subject: Re: [pysal/libpysal] Weights sprint planning (Issue #534)
This message could be from someone attempting to impersonate a member of UoB. Please do not share information with the sender without verifying their identity. If in doubt, please contact the IT Service Desk for advice. --
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
— Reply to this email directly, view it on GitHubhttps://github.com/pysal/libpysal/issues/534#issuecomment-1644290716, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AARFR4YZZN42OQ5YHBEZKW3XRFRGHANCNFSM6AAAAAA2OCBVBU. You are receiving this because you were mentioned.Message ID: @.***>
@ljwolf question on the signature of delaunay
Should the first arg be an array or a geoseries/gdf? https://github.com/pysal/libpysal/blob/88fb4b618840c36beb3dfff1e033a6de515733c9/libpysal/weights/experimental/_triangulation.py#L25-L36
as later on geoms
is passed for validation:
Not sure if the docs were to be updated or the check requires some preprocessing as geoms
is undefined as currently implemented. This made me think the first argument is intended to be something other than documented at present?
I don't think having one from_dataframe
method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.
I think that constructors like from_triangulation
could easily accept both GeoPandas objects and an array of coords.
or do we take a generic w from input W.from_dataframe(gdf)
and define a conversion to_kernel, to_voronoi, etc?
No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.
I'm not sure I properly understand the problem though.
@ljwolf question on the signature of
delaunay
Should the first arg be an array or a geoseries/gdf?
The input should be able to be either an n,2
array of coordinates, a geoseries of POINT
, or a numpy array with POINT
shapely geometries. The input coordinates
should be validated using:
and the output of that is used for the computation.
So, everywhere where we use _validate_geometry_input()
in _triangulation
, for example:
geoms
should be coordinates
I don't think having one
from_dataframe
method is a good idea. It would have a ton of conditional keywords and would be messy and hard to understand.I think that constructors like
from_triangulation
could easily accept both GeoPandas objects and an array of coords.or do we take a generic w from input W.from_dataframe(gdf) and define a conversion to_kernel, to_voronoi, etc?
No. Once you create W you have the adjacency. From there, it shouldn't matter if that table was created based on contiguity, kernel or using manually encoded weights. You don't have any information that could be used to do conversion to voronoi at that point and you should not have imho.
I'm not sure I properly understand the problem though.
My preferred option would be:
kernel()
, vertex_set_intersection()
, rook()
, delaunay()
, gabriel()
) take GeoSeries
or arrays of coordinates, and return plain W
objects. So, calls look like
my_w = kernel(my_gdf.geometry, k=3)
my_w = delaunay(my_coordinates)
. W
class itself uses from_*
to read/convert existing objects into a W
structure. So, calls look like:
W.from_networkx(nx_graph)
W.from_file('./columbus.gal')
W.from_sparse(my_coo)
W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
W.from_arrays(my_df.here, my_df.there, my_df.strength)
No W.from_dataframe()
, so that there's no confusion about what "dataframe" means.
So, instead of
W.from_dataframe(gdf, 'contiguity', queen=True)
We document/recommend
w = weights.contiguity.rook(gdf)
We document/recommend
w = weights.contiguity.rook(gdf)
Mmm, not sure I am fan of that. The solution with from_contiguity
and similar is elegant in a sense that you always create W with some W.from*
constructor and you don't have fish in the documentation and different sub modules how to create one.
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
We document/recommend w = weights.contiguity.rook(gdf)
Mmm, not sure I am fan of that. The solution with
from_contiguity
and similar is elegant in a sense that you always create W with someW.from*
constructor and you don't have fish in the documentation and different sub modules how to create one.
What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)
?
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
From @ljwolf https://github.com/pysal/libpysal/issues/534#issuecomment-1644268455
I feel like "from_kernel()" and the like would be confusing. To me, that implies "kernel" is an input data format, not the structure of the output of the method. "to_kernel()" would imply that, but I think then "kernel" should probably be a type?
From @knaaptime https://github.com/pysal/libpysal/issues/534#issuecomment-1644278338
yeah, completely agree. You dont do from_kernel, because your input data isnt "a kernel".
My preferred option would be:
- the constructor functions (
kernel()
,vertex_set_intersection()
,rook()
,delaunay()
,gabriel()
) takeGeoSeries
or arrays of coordinates, and return plainW
objects. So, calls look like
my_w = kernel(my_gdf.geometry, k=3)
my_w = delaunay(my_coordinates)
.
- the
W
class itself usesfrom_*
to read/convert existing objects into aW
structure. So, calls look like:
W.from_networkx(nx_graph)
W.from_file('./columbus.gal')
W.from_sparse(my_coo)
W.from_adjacency(my_df, focal_col='here', neighbor_col='there', weight_col='strength')
W.from_arrays(my_df.here, my_df.there, my_df.strength)
No
W.from_dataframe()
, so that there's no confusion about what "dataframe" means.So, instead of
W.from_dataframe(gdf, 'contiguity', queen=True)
We document/recommend
w = weights.contiguity.rook(gdf)
i get it. So we have constructor functions rather than classmethod constructors when you need to generate the encoding, and _from
classmethods when you already have the encoding laid out in an existing data structure. Thats a change of pace, but it makes sense to me
What about pulling them up into the higher-level namespace? weights.rook(gdf)/weights.kernel(gdf)?
maybe this. I do think it would make things easier to do from libpysal import weights
and be able to tab complete to see there's a rook
function, etc. That makes it a little clearer rook
is definitely a weights builder
What is the issue with this API design? I'm still not getting the original problem with that that needs solving.
@martinfleis if we still need to "hook kernel (and triangulation) constructors up to the base class", what does that mean? whats a the intended signature for instantiating a kernel-based W from a dataframe? whats the signature for instantiating a kernel-based W from a precomputed distance matrix? (since we agree _from_kernel doesnt make sense)?
kernel weight from a precomputed distance matrix
That should be
weights.kernel(D, metric="precomputed", kernel="gaussian")
I originally assumed that it means from_kernel
and from_triangulation
methods. But if you don't think that it is a good idea let's figure out a better option.
I suggest we try to ensure we have the internal functions that create those adjacency tables finished first and the we can discuss the public API during a call on Friday during the next sprint? (Note that I have only my phone with me until Sunday so I haven't checked the actual state of the code)
This is what I had in mind.
@classmethod
def from_kernel(
cls,
data,
bandwidth=None,
metric="euclidean",
kernel="gaussian",
k=None,
p=2,
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries over which to compute a kernel. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates. If metric="precomputed", data is assumed to contain a
precomputed distance metric.
bandwidth : float (default: None)
distance to use in the kernel computation. Should be on the same scale as
the input coordinates.
metric : string or callable (default: 'euclidean')
distance function to apply over the input coordinates. Supported options
depend on whether or not scikit-learn is installed. If so, then any
distance function supported by scikit-learn is supported here. Otherwise,
only euclidean, minkowski, and manhattan/cityblock distances are admitted.
kernel : string or callable (default: 'gaussian')
kernel function to apply over the distance matrix computed by `metric`.
The following kernels are supported:
- triangular:
- parabolic:
- gaussian:
- bisquare:
- cosine:
- boxcar/discrete: all distances less than `bandwidth` are 1, and all
other distances are 0
- identity/None : do nothing, weight similarity based on raw distance
- callable : a user-defined function that takes the distance vector and
the bandwidth and returns the kernel: kernel(distances, bandwidth)
k : int (default: None)
number of nearest neighbors used to truncate the kernel. This is assumed
to be constant across samples. If None, no truncation is conduted.
ids : numpy.narray (default: None)
ids to use for each sample in coordinates. Generally, construction functions
that are accessed via W.from_kernel() will set this automatically from
the index of the input. Do not use this argument directly unless you intend
to set the indices separately from your input data. Otherwise, use
data.set_index(ids) to ensure ordering is respected. If None, then the index
from the input coordinates will be used.
p : int (default: 2)
parameter for minkowski metric, ignored if metric != "minkowski".
Returns
-------
W
libpysal.weights.experimental.W
"""
This in theory supports also DistanceBand and KNN but it may be worth exposing those as separate constructors for the sake of user friendliness. We can pipe those through this internally.
@classmethod
def from_triangulation(
cls,
data,
method="delaunay",
bandwidth=np.inf,
kernel="boxcar",
clip="extent",
contiguity_type="rook",
):
"""_summary_
Parameters
----------
data : numpy.ndarray, geopandas.GeoSeries, geopandas.GeoDataFrame
geometries containing locations to compute the
delaunay triangulation. If a geopandas object with Point
geoemtry is provided, the .geometry attribute is used. If a numpy.ndarray
with shapely geoemtry is used, then the coordinates are extracted and used.
If a numpy.ndarray of a shape (2,n) is used, it is assumed to contain x, y
coordinates.
method : str, (default "delaunay")
method of extracting the weights from triangulation. Supports:
- delaunay
- gabriel
- relative_neighborhood
- voronoi
bandwidth : _type_, optional
distance to use in the kernel computation. Should be on the same scale as
the input coordinates, by default numpy.inf
kernel : str, optional
kernel function to use in order to weight the output graph. See
:meth:`W.from_kernel` for details. By default "boxcar"
clip :str (default: 'bbox')
An overloaded option about how to clip the voronoi cells passed to
cg.voronoi_frames() when method="voronoi. Ignored otherwise.
Default is ``'extent'``. Options are as follows.
* ``'none'``/``None`` -- No clip is applied. Voronoi cells may be
arbitrarily larger that the source map. Note that this may lead to
cells that are many orders of magnitude larger in extent than the
original map. Not recommended.
* ``'bbox'``/``'extent'``/``'bounding box'`` -- Clip the voronoi cells to
the bounding box of the input points.
* ``'chull``/``'convex hull'`` -- Clip the voronoi cells to the convex hull
of the input points.
* ``'ashape'``/``'ahull'`` -- Clip the voronoi cells to the tightest hull
that contains all points (e.g. the smallest alphashape, using
``libpysal.cg.alpha_shape_auto``).
* Polygon -- Clip to an arbitrary Polygon.
contiguity_type : str, optional
What kind of contiguity to apply to the voronoi diagram when ethod="voronoi.
Ignored otherwise. Supports "rook" and "queen", by default "rook".
Returns
-------
W
libpysal.weights.experimental.W
"""
@ljwolf I think that for voronoi, we don't need to expose rook
and queen
constructors and can go with the vertex method only because the diagram is always topologically precise.
Was there any discussion on these last time that concluded that this is not an optimal API?
Was there any discussion on these last time that concluded that this is not an optimal API?
yeah, lets discuss the tradeoffs on friday. Given the initial layout in the experimental
branch, what you sketched above was my original understanding too. But I think we decided the from_*
nomenclature is a bit misleading. When you do from_kernel
above, you pass the function a geodataframe, not a kernel. The same with triangulation; the first argument is either a geometry object or a list of coordinates, then an intermediate kernel/triangulation is constructed immediately before giving you a weights object. So it's not really weights object from_kernel
, it's kernel from_dataframe
to weights object.
I can see arguments both ways. It feels natural to do .from_kernel
in line with from_contiguity
, but the semantics of that aren't actually accurate
Got it. Let's pick it up on Friday. We may keep the signature and just change the names to something more fitting eventually.
So, on the sprint call, we considered two following APIs:
W
In this example, the core design principle is:
W
is self-contained. Most of what you need to do can be done only with theW
object, not requiring theweights
namespace.
The W.build_...()
idiom constructs a W
under a known graph construction rule, such as:
W.build_rook(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry)
W.build_kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)
The W.from_...()
idiom constructs a W
from a specific datastructure
W.from_file("columbus.gal")
W.from_arrays(head, tail, weight)
W.from_sparse(my_coo_array)
To make good on this design decision, we should also implement a .higher_order()
, a .lag()
, etc. This is a pandas
-style API, rather than expecting users to switch between functions in weights
and methods on W
. This allows for method-chaining, such as:
W.build_kernel(geoseries).transform('r').lag(geoseries.home_value)
In this example, the core design principle is
W
is only a results container. Functions inweights
operate on top ofW
objects and generally construct or mutate them.
We would have to implement a construct
namespace, so that:
weights.construct.contiguity(geodataframe.geometry) # calls weights._contiguity.rook(geodataframe.geometry) by default
# or the constructor actually lives at `weights.construct.contiguity`
weights.construct.kernel(coordinate_array) # calls weights._kernel.kernel(coordinate_array)
An ingest
(convert?
) namespace takes a dataframe and "ingests" it to a W:
weights.ingest.file("columbus.gal")
weights.ingest.arrays(head, tail, weight)
weights.ingest.sparse(my_coo_array)
and any of our functions in weights
act on W
objects to build (or mutate) W
objects. This is a numpy
-style API.
@knaaptime @martinfleis @ljwolf @rongboxu agree on the method strategy, option 1.
On the coincident points issue identified in Delaunay, we need to be consistent about how the new implementation deals with coincident points. Naively, Qhull implements a bunch of options. but we need to define a consistent way to deal with coincident points in KNN, too.
For handling coincident points, we discussed the following options:
randomly permute/jitter the input points if there is a coincident point
reduce the dataset to unique locations, construct the weight, and then introduce a "clique" into the graph connecting all coincident points together and also to other neighbours according to the construction rule.
detect that there is a coincident point and raise an error if that is the case.
@knaaptime @sjsrey @martinfleis @TaylorOshan @rongboxu @ljwolf agree that the implementation should generally follow the "raise" strategy.
All weights constructors should take a "coincident" argument that can take three values:
"raise"
Default, raise an error when the coincident points would create an invalid output. This affects all Delaunay/Voronoi-type weights, and KNN when the number of duplicated points is greater than the requested k
. "clique"
Compute the weights on unique locations in the data. Induce a "clique" into the graph, connecting all coincident points to one another and to the other values for that location. "jitter"
Add a very small random displacement (should use a fraction of the smallest pairwise distance) to all coordinates.@sjsrey @knaaptime @ljwolf @martinfleis agreed to rename W to Graph and move the namespace.
I've just realised: for islands, the nonzero
property will count them as a nonzero, since self.sparse.nnz
counts "explicit" zeros as nonzero. This might throw off calculations in join counts, since a self-join is always going to contribute to positive autocorrelation (other cases, like Getis-Ord and Moran, we need it)...
We might want to think about using n_edges
/n_nodes
and then define nonzero
(number of non-self loops) as self.n_edges - len(self.isolates)
n_edges
is just len(self._adjacency)
no?
yep
id ordering not respected when roundtripping through sparse
https://gist.github.com/knaaptime/d17b2d233c663aeec3e0e95487910560
@knaaptime regarding this roundtripping - the distance band with isolates or without isolates works fine. The Graphs are equal, the sorting within a single focal does not matter. Sparse is also equal based on both (original and roundtripped).
The issue with KNN has nothing to do with sorting or roundtripping but with duplicated point locations. KNN builder drops 4 observations. That will be partially solved by the application of #548 to KNN. I will also add length checks to raise when the ids != length of the sparse.
More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique()
is equal to the index of original observations from which the Graph is being built. That ensures that any conversion depending on the order is correct since it will either depend on _id2i
based on .unique()
or on factorize
, resulting in the same.
Edit: This works under an assumption that every observation is present in focal
, either with a neighbor or as a self-loop with 0.
More generally on sorting - I believe that the only sorting that matters is the position of the first observation of each focal. We only need to ensure that the arrays are sorted in a way that ensures that focal.unique() is equal to the index of original observations from which the Graph is being built.
exactly. We always need to respect the index of the original input, which means sometimes we are forced to sort. That's where i ended up in #513 too, so i just used a reindex on those ids at both levels, which is the same as keeping id2i around. Groupby ops will break that ordering, so we need to be able to reindex by the focal ordering arbitrarily (which is why i let sort_neighbors stick around). That way 'the position of the first observation of each focal' is always respected
I was going directly from adj to sparse, so had to keep track of the column indices as well otherwise you hit the pivot issue
I think that, insofar as we have used the term, "sorting" refers to re-ordering the weights into an arbitrary order with respect to the input data. Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data. I don't think column ordering/tail ordering can be guaranteed unless ids are provided?
Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.
isnt this what im saying?
I don't think column ordering/tail ordering can be guaranteed unless ids are provided?
i dont think im following. If no ids are provided, then the ordering of the columns is the range index of the rows? (if no ids are provided, then the index is the id?).
if we always know that both rows and colums of sparse are ordered in the canonical sorting given by the rows of the input, isnt that ideal?
if we always sort both levels of the adjlist by the order given in the input data, then its in the same conceptual structure as a sparse matrix with both rows and cols arranged i through j. That way, when you pivot from table to sparse everything is in the right place without needing to specify row/col labels
Our goal of "no sorting" should be more clearly defined as: output sparse matrices should always conform to the order of the input data.
If we create _id2i
on __init__
, then the adjacency can be ordered as random as it can get. sparse
does not depend directly on the order of adjacency but on the _id2i
mapping. The only other thing apart from unique_ids
that is used to create _id2i
that currently assumes some sorting is to_W
which can be refactored. Then with stuff like cliques
, we take the original index, create _id2i
and then just simply dump cliques at the end of the adjacency, no sorting needed.
This means that we either expose this on a subset of constructors (like from_arrays
) or always derive it from given data, posing some assumptions on how the input data should look like.
right we can always create id2i from the input data, so the adjlist should always respect that order. In the current structure, the adjlist is sometimes random, but sparse is always ordered according to id2i. I think thats inconsistent and its better to just say id2i is the sort order. And i think its much easier to rely on pandas indices for handling indexing rather than rolling from scratch
we're not posing any assumptions on the input data on my implementation, we're just saying the adjlist order should match the sparse order
sparse does not depend directly on the order of adjacency but on the _id2i mapping.
so, since id2i is static because it needs to match the input order, it means sparse is static, so why is adjlist arbitrary? That just creates one more index we need to map between. Lets just use a single index, since its the only one that matters (which was the original idea of using a sparse multiindex series)
So you would enforce sorting of the adjacency and drop id2i
, since we can infer it from the order?
sparse does not depend directly on the order of adjacency but on the _id2i mapping.
yeah, i think this is my point
So you would enforce sorting of the adjacency and drop id2i, since we can infer it from the order?
i think so, but maybe we stash it for convenience. I think id2i forces us to consider that there is an inherent order to the data
The whole idea of using a data frame as the central structure is to avoid doing the mapping and indexing ourselves. Why in the world would we let the adjlist be random when we're using the indices to provide structure? if id2i is structured, and sparse is structured, but adjlist is random (at the neighbor level), then how is that the 'canonical' representation of the graph? Lets just get it into the correct shape once, and let adjacency
be the actual central construct.
The only other thing apart from unique_ids that is used to create _id2i that currently assumes some sorting is to_W which can be refactored.
and i think maybe this is the crux of it. We're still struggling with whether the graph is ordered or not. Since the ordering of the rows matters (needs to match input order for things like lag
), then the graph is ordered at one level. If we are forced to recognize that order, then lets actually recognize it. We already have a formally-ordered output in sparse
--and that order matters, so it's not true that to_W
is the only thing that depends on ordering...
Since formal order matters in sparse, why allow an arbitrarily-ordered one in adjacency
?. I think our cognitive dissonance is showing if we say order matters for one representation, but the 'canonical data structure' at the center of the graph is arbitrary.
I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.
Also, sorting on init can not be just sort_values(['focal', 'neighbor'])
but sort using the original order of the index of geometries.
I think that this boils down to what "no sorting" means. It can mean "no sorting is expected from adjacency" or "no sorting happens during internal operations". The former requires id2i and sorting when casting to sparse, the latter requires sorting on initialisation. Neither is fully "no sorting" but the latter may be preferable.
yes, well said. I think we're sorting no matter what, so 'explicit is better than implicit' and i'd prefer to do it on init (and say we don't sort after)
Also, sorting on init can not be just sort_values(['focal', 'neighbor']) but sort using the original order of the index of geometries.
so on init, i would do something like this. Set the index as ['focal','neighbor'], then reindex both levels by geometry index and reset the index
...and luckily, this is what raised the issue 😁. So instead of having this happen only in from_adjacency
, i'd have it in the init
One thing this doesn't solve is the existence of id2i
, because for any other index than RangeIndex(0, n)
, we still need, even on the fly, mapping from ids to integers when going to sparse. That trick with pandas sparse you used before is applicable only for sparse.matrix
, not sparse.array
, so I'm not sure we want to depend on that for now.
ah snap. bummer. but i think thats ok... lets keep id2i as a private attribute and use that as our 'reindexer'?
so, without changing much code: id2i
is the order of the graph, so all representations can be expected to follow that structure. Since i was using a multiindexed series instead of a dataframe, the first level of the index defines the order of the graph
Probably It is also used as inverse i2id in stuff that is computed on sparse, like asymmetry. We could do them on the fly based on the sorted index but then a private cached property as we have now is probably a better solution.
I'm starting this thread to begin discussions about the upcoming weights sprint. The current experimental implementation is in the
geographs
branch, theweights/experimental
module. TheROADMAP.md
will need updating...