`nodes._status()` -- `'new'` in `pandas.Series` always False due to evaluation

jGaboardi commented 1 week ago

Following #83 (and @martinfleis's explanation there) I want to triple check the correctness of the 'new' in ... condition in nodes._status(). Currently this will never evaluate to True since it is passed in as a pandas.Series:

import pandas, sgeop

# this should print `new` but does not
print(sgeop.nodes._status(pandas.Series(["one", "new", "two"])))
# "changed"

# due to how `pandas.Series` is evaluated
print("new" in pandas.Series(["one", "new", "two"]))
# False

# so we need to evaluate the values of the Series
print("new" in pandas.Series(["one", "new", "two"]).values)
# True

For the small test case of Apalachicola, FL the results of '_status' change as follows:

current main -- 'new' in x:

.["_status"].value_counts()
# original    549
# changed     125
# new          95
# Name: count, dtype: int64

updated condition -- 'new' in x.values:

.["_status"].value_counts()
# original    549
# new         209
# changed      11
# Name: count, dtype: int64

This of course affects the full scale FUA testing for '_status' labeling.

xref:

92

The plot here highlights the difference where:

thinnest black lines are the original input
medium lines are the updated conditions
thickest lines are the original conditions

_status_comparison

It does appear that the results from the original condition are the most correct where lines that are extended are not marked as entirely new. So I am pretty sure our current implementation is accurate, but we should be absolutely sure.

Data:

gh93_status_comparison.zip

martinfleis commented 1 week ago

How come the sum of value counts above is different?

jGaboardi commented 1 week ago

How come the sum of value counts above is different?

In [1]: (549 + 125 + 95) == (549 + 209 + 11)
Out[1]: True

In [2]: (549 + 125 + 95)
Out[2]: 769

In [3]: (549 + 209 + 11)
Out[3]: 769

They are equivalent, no?

martinfleis commented 1 week ago

I can't count....

jGaboardi commented 1 week ago

So if x is passed in as a list then the condition can potentially evaluate as True (otherwise the condition will always evaluate to False no matter if 'new' is present). But after combing through the code base I am not finding anywhere we are doing that. @martinfleis Any more thoughts off the top of your head here?

While this does not have a direct impact on geometries being generated, it does have a meaningful impact on geometry labeling, which then could affect the algorithm (from my understanding).

I'd say we need to figure this out before I continue with testing/refactor and transfer functionality to simplification.core.

martinfleis commented 1 week ago

No idea...

jGaboardi commented 1 week ago

But after combing through the code base I am not finding anywhere we are doing that.

Confirmed programmatically there are no instances of passing in a list

jGaboardi commented 1 week ago

So I guess we need to determine if our logic of "new" vs. "changed" is sound. Currently:

def _status(x: pd.Series) -> str:
    """Determine the status of edge line(s)."""
    if len(x) == 1:
        return x.iloc[0]
    if "new" in x:
        # This logic is here just to be safe. It will be hit if we create a new line
        # and in a subsequent step extend it which is not what normally happens.
        # All the new bits are caught likely by the first ``if``.
        return "new"
    return "changed"

... a linestring geometry is considered "new" only if len(x) == 1 and "new" is that value. Since the second conditional is flawed.

So:

Is a line "new" if partially new or entirely new?
I think "new" only if entirely new, elsewise "changed".
Therefore, I think our current labeling is correct and we can simply remove that final conditional

@martinfleis @anastassiavybornova Do yall concur with this logic?

martinfleis commented 1 week ago

I think that's right, yes.

uscuni / sgeop

`nodes._status()` -- `'new'` in `pandas.Series` always False due to evaluation #93

92