ENH: Graph IO to classic weights file formats

martinfleis commented 3 months ago

WIP and not very well tested (in a sense that I am not certain it is always 1:1 with weights implementation).

So far, GAL. I am also planning to look at GWT. Is there anything else that is commonly used?

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 98.80952% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 85.0%. Comparing base (bcabdbc) to head (f014971). Report is 15 commits behind head on main.

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/pysal/libpysal/pull/698/graphs/tree.svg?width=650&height=150&src=pr&token=wgnkG5Rj0J&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal)](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal) ```diff @@ Coverage Diff @@ ## main #698 +/- ## ======================================= - Coverage 85.0% 85.0% -0.0% ======================================= Files 141 145 +4 Lines 15203 15361 +158 ======================================= + Hits 12924 13055 +131 - Misses 2279 2306 +27 ``` | [Files](https://app.codecov.io/gh/pysal/libpysal/pull/698?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal) | Coverage Δ | | |---|---|---| | [libpysal/\_\_init\_\_.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2F__init__.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvX19pbml0X18ucHk=) | `100.0% <100.0%> (ø)` | | | [libpysal/graph/\_\_init\_\_.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2F__init__.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvX19pbml0X18ucHk=) | `100.0% <100.0%> (ø)` | | | [libpysal/graph/\_contiguity.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2F_contiguity.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvX2NvbnRpZ3VpdHkucHk=) | `98.9% <100.0%> (+<0.1%)` | :arrow_up: | | [libpysal/graph/\_utils.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2F_utils.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvX3V0aWxzLnB5) | `95.0% <100.0%> (+<0.1%)` | :arrow_up: | | [libpysal/graph/base.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2Fbase.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvYmFzZS5weQ==) | `97.0% <100.0%> (-1.0%)` | :arrow_down: | | [libpysal/graph/io/\_gwt.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2Fio%2F_gwt.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvaW8vX2d3dC5weQ==) | `100.0% <100.0%> (ø)` | | | [libpysal/graph/io/\_parquet.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2Fio%2F_parquet.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvaW8vX3BhcnF1ZXQucHk=) | `84.0% <ø> (ø)` | | | [libpysal/graph/tests/test\_base.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2Ftests%2Ftest_base.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvdGVzdHMvdGVzdF9iYXNlLnB5) | `100.0% <100.0%> (ø)` | | | [libpysal/graph/io/\_gal.py](https://app.codecov.io/gh/pysal/libpysal/pull/698?src=pr&el=tree&filepath=libpysal%2Fgraph%2Fio%2F_gal.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal#diff-bGlicHlzYWwvZ3JhcGgvaW8vX2dhbC5weQ==) | `96.2% <96.2%> (ø)` | | ... and [2 files with indirect coverage changes](https://app.codecov.io/gh/pysal/libpysal/pull/698/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pysal)

martinfleis commented 3 months ago

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

sjsrey commented 2 months ago

Can someone with a bit of historical knowledge (@serge, @levi?) help me understand the treatment of headers here? GeoDa's User Guide from 2003 states for GAL that

When a Key Variable is specified, that line contains four values: 0 (reserved for future use), the number of observations (100), the name of the shape file (SIDS) and the variable name for the Key Variable (FIPSNO). When sequence numbers are used to label the observations, the header line only contains the number of observations

Our existing GAL writing uses only the number of observations in the header, but the actual IDs of observations, not the sequence number (which I translate that positional index (iloc)).

With GWT, the definition is practically the same:

When a Key Variable has been specified, the header line is as in Figure 126, for k-nearest neighbors of order 4 in the Columbus data set. It contains four items: 0 (for future use), the number of observations (49), the name of the shape file (COLUMBUS) and the Key Variable (POLYID). When no Key Variable is specified, but sequence numbers are used, the header line consists only of the number of observations.

But our implementation does not use just the number of observations like in the GAL case but 0 n_obs Unknown Unknown. And again the actual indices.

Given there is apparently no other documentation of these file formats, what are the correct headers?

If we should assume that with header consisting of a number of observations only, the IDs are positional indices, than what GAL is currently doing is wrong and we should do what GWT is doing in both. Though it makes a little sense to write that we don't know something.

Any clue how the header should look like for maximum compatibility? Anyone has spdep ready to check what they're doing?

Here is how spdep reads gwt files and gal files.

martinfleis commented 1 month ago

This should be ready for review now. Interestingly, it has uncovered a bug in our conversion from dicts to arrays (and adjacency), where the tooling was not able to process self-weights of 1 and always considered focal == neighbor as an isolate, giving it 0. That should be fixed now.

martinfleis commented 1 month ago

Also, regarding my questions above... it seems that spdep allows only integer IDs (positional) and given there is no documentation of either of those file format whatsoever, I tried to ensure that the graph IO matches the output of weights IO, so we are consistent with ourselves.

ljwolf commented 1 month ago

looks fine to me! good catch on the self-weight.

We need to be consistent about that, since esda will require an overhaul once it's done. Those statistics, especially the local ones, ignore self-weight effects.

pysal / libpysal

ENH: Graph IO to classic weights file formats #698

Codecov Report