pysal / libpysal

Core components of Python Spatial Analysis Library
http://pysal.org/libpysal
Other
265 stars 78 forks source link

remap_ids significantly slows down for weights with a large number of neighbors #433

Open barguzin opened 3 years ago

barguzin commented 3 years ago

I am working with CA census tracts, but any shapefile with >2000 rows would suffice. The problem is with libpysal.weight.util.block_weights function, which runs infinitely after a certain threshold of dataframe size is met. I tried running the function on my laptop at first (see specs above). My first guess was that it could be related to the amount of RAM or CPU usage, so I tried running it on the server (256GB RAM, 32 CPU cores, CentOS). In both cases the computer seems to go into this infinite computation loop at the size of 1500-1600 rows.

Here is the reproducible code demo:

wget 'https://www2.census.gov/geo/tiger/TIGER2019/TRACT/tl_2019_06_tract.zip'; unzip tl_2019_06_tract.zip -d tracts
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import geopandas as gpd 
import libpysal
import os

# define function that generates random numbers 
def get_random_classes(n, var, dataframe):
    '''
    generate random variable containing n-classes and create a new variable in the dataframe
    '''
    dataframe["{}".format(var)] = np.random.randint(n, size=dataframe.shape[0])

tracts = gpd.read_file('tracts/tl_2019_06_tract.shp')
print(tracts.shape)

get_random_classes(n = 3, var = 'clusters', dataframe = tracts)

# create spatial weights matrix
# W = libpysal.weights.Queen.from_dataframe(tracts)

sam1 = tracts.sample(n=2000) # change sample size here

w_bl_sam1 = libpysal.weights.util.block_weights(
    sam1['clusters'].values, 
    ids=sam1.index
)
martinfleis commented 3 years ago

Thanks for the report!

Yeah, I can confirm the behaviour. It is not an issue with block_weights but with remap_ids called under the hood to map ids=sam1.index onto weights. This loop becomes dead slow with a case like this when there's a lot of neighbours per each observation.

https://github.com/pysal/libpysal/blob/2cc0c7c87467f03d9bc356b217398535390a952e/libpysal/weights/weights.py#L754-L760

The script eventually finishes, it is not an infinite loop. But it is super slow.

In this specific case, I'd try to avoid passing ids=sam1.index and rely on positional indexing.