oturns / geosnap

The Geospatial Neighborhood Analysis Package
https://oturns.github.io/geosnap-guide
BSD 3-Clause "New" or "Revised" License
237 stars 32 forks source link

Harmonize returning nan for intensive variables #249

Closed sjsrey closed 3 years ago

sjsrey commented 3 years ago
import geosnap
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 17133.59entries/s]
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 7020.93entries/s]
geosnap.__version__
'0.3.2'
sd = geosnap.Community.from_census(county_fips='06073')
/home/serge/anaconda3/envs/geosnapdev/lib/python3.7/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
sd.gdf.head()
geoid n_mexican_pop n_cuban_pop n_puerto_rican_pop n_total_housing_units n_vacant_housing_units n_occupied_housing_units n_owner_occupied_housing_units n_renter_occupied_housing_units n_white_persons ... p_irish_born_pop p_italian_born_pop p_poverty_rate_children p_poverty_rate_hispanic p_russian_born_pop p_scandanavian_born_pop p_scandanavian_pop n_total_pop_sample p_female_labor_force p_black_persons
5188 06073019000 5094.0 2.0 5.0 2326.0 184.0 2142.0 1769.0 373.0 4898911.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5189 06073018700 30479.0 72.0 611.0 4819.0 50.0 4769.0 250.0 4519.0 215361577.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5205 06073019101 2497.0 0.0 9.0 1497.0 154.0 1343.0 858.0 485.0 16221375.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5206 06073020901 4839.0 7.0 5.0 3042.0 995.0 2047.0 1430.0 617.0 4416241.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5207 06073021000 2337.0 0.0 15.0 2457.0 1154.0 1303.0 979.0 324.0 2302215.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 195 columns

extensive = ['n_total_pop', 'n_total_housing_units', 'n_vacant_housing_units']
intensive = ['median_household_income', 'p_poverty_rate']
sd.gdf['median_household_income'] = sd.gdf['median_household_income'].fillna(0)
sd.gdf['median_household_income'].isnull().sum()
0
sd.gdf['p_poverty_rate'] = sd.gdf['p_poverty_rate'].fillna(0)
sd.gdf['p_poverty_rate'].isnull().sum()
0
sd_2010 = sd.harmonize(2010,extensive_variables=extensive,
                      intensive_variables=intensive)
/home/serge/anaconda3/envs/geosnapdev/lib/python3.7/site-packages/tobler/area_weighted/area_weighted.py:249: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/home/serge/anaconda3/envs/geosnapdev/lib/python3.7/site-packages/tobler/area_weighted/area_weighted.py:249: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/home/serge/anaconda3/envs/geosnapdev/lib/python3.7/site-packages/tobler/area_weighted/area_weighted.py:249: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
sd_2010.gdf.head()
geoid geometry median_household_income n_total_housing_units n_total_pop n_vacant_housing_units p_poverty_rate year
0 06073014901 POLYGON ((-117.01957 32.76373, -117.01562 32.7... NaN 1876.000000 4156.000000 131.00000 NaN 2010
1 06073000300 POLYGON ((-117.16864 32.74897, -117.16602 32.7... NaN 3046.000000 4629.000000 397.00000 NaN 2010
2 06073000800 POLYGON ((-117.14632 32.74842, -117.14250 32.7... NaN 2702.000000 3964.000000 253.00000 NaN 2010
3 06073002201 POLYGON ((-117.11577 32.75522, -117.11362 32.7... NaN 1321.000000 3989.000000 76.00000 NaN 2010
4 06073018509 POLYGON ((-117.37213 33.20012, -117.36902 33.2... NaN 1700.999899 5325.999683 171.99999 NaN 2010
knaaptime commented 3 years ago

that's probably related to this recent fix in tobler. I'll investigate

sjsrey commented 3 years ago

Tobler seems ok. When I use the1990 and 2000 dataframes, the interpolation works for the intensive variables (with tobler not in geosnap)

image

knaaptime commented 3 years ago

i think this should be resolved with the newest fix to tobler, but i need to double check

knaaptime commented 3 years ago

i can confirm this is resolved with the latest dev version of tobler

sjsrey commented 3 years ago
import geosnap
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 14779.08entries/s]
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 18657.94entries/s]
/usr/local/anaconda3/envs/pysal/lib/python3.7/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
geosnap.__version__
'0.3.2'
sd = geosnap.Community.from_census(county_fips='06073')
/usr/local/anaconda3/envs/pysal/lib/python3.7/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
import tobler
tobler.__version__
'0.3.1'
from tobler.area_weighted import area_interpolate
extensive = ['n_total_pop', 'n_total_housing_units', 'n_vacant_housing_units', 'n_black_persons', 'n_hispanic_persons' ]
intensive = ['median_household_income']
gdfs = [sd.gdf[sd.gdf.year==year] for year in [1990,2000,2010]]
sd1990, sd2000, sd2010 = gdfs
extensive = ["n_total_pop"]
intensive = ['median_household_income']
sd19902010 = area_interpolate(sd1990, sd2010, extensive_variables=extensive,
                intensive_variables=intensive)
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
sd19902010.shape
(627, 3)
sd1990.shape
(438, 195)
sd2010.shape
(627, 195)
sd19902010.head()
n_total_pop median_household_income geometry
0 3780.045461 30027.580506 POLYGON ((-117.01957 32.76373, -117.01562 32.7...
1 4109.068579 27928.654437 POLYGON ((-117.16864 32.74897, -117.16602 32.7...
2 3778.913865 25588.558303 POLYGON ((-117.14632 32.74842, -117.14250 32.7...
3 3127.204323 16784.034521 POLYGON ((-117.11577 32.75522, -117.11362 32.7...
4 4431.874144 28626.872160 POLYGON ((-117.37213 33.20012, -117.36902 33.2...
sd20002010 = area_interpolate(sd2000, sd2010, extensive_variables=extensive,
                intensive_variables=intensive)
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
sd20002010.head()
n_total_pop median_household_income geometry
0 3882.981322 40175.734975 POLYGON ((-117.01957 32.76373, -117.01562 32.7...
1 4258.772452 37318.052239 POLYGON ((-117.16864 32.74897, -117.16602 32.7...
2 4039.971454 35394.196027 POLYGON ((-117.14632 32.74842, -117.14250 32.7...
3 3873.042164 20704.730669 POLYGON ((-117.11577 32.75522, -117.11362 32.7...
4 5899.830878 36428.660886 POLYGON ((-117.37213 33.20012, -117.36902 33.2...
sd20002010.n_total_pop.sum()
2817686.995202498
sd2000.n_total_pop.sum()
2817687.0
# now with geosnap
sd_2010_gs = sd.harmonize(2010, extensive_variables=extensive,
                         intensive_variables=intensive)
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/serge/Dropbox/p/pysal/src/subpackages/tobler/tobler/util/util.py:28: UserWarning: nan values in variable: median_household_income, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")
sd_2010_gs.gdf.head()
geoid geometry median_household_income n_total_pop year
0 06073014901 POLYGON ((-117.01957 32.76373, -117.01562 32.7... NaN 4156.000000 2010
1 06073000300 POLYGON ((-117.16864 32.74897, -117.16602 32.7... NaN 4629.000000 2010
2 06073000800 POLYGON ((-117.14632 32.74842, -117.14250 32.7... NaN 3964.000000 2010
3 06073002201 POLYGON ((-117.11577 32.75522, -117.11362 32.7... NaN 3989.000000 2010
4 06073018509 POLYGON ((-117.37213 33.20012, -117.36902 33.2... NaN 5325.999683 2010
knaaptime commented 3 years ago

I cant reproduce this locally. With the latest development version of tobler (master on pysal/tobler) installed in my geosnap environment, I get the following

import geosnap
/Users/knaaptime/Dropbox/projects/geosnap/geosnap/_data.py:123: UserWarning: Unable to locate local census data. Streaming instead.
If you plan to use census data repeatedly you can store it locally with the io.store_census function for better performance
  "Unable to locate local census data. Streaming instead.\n"
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 7327.58entries/s]
Loading manifest: 100%|██████████| 5/5 [00:00<00:00, 5726.79entries/s]
sd = geosnap.Community.from_census(county_fips='06073')
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  return _prepare_from_string(" ".join(pjargs))
extensive = ['n_total_pop', 'n_total_housing_units', 'n_vacant_housing_units']
intensive = ['median_household_income', 'p_poverty_rate']
sd_2010 = sd.harmonize(2010,extensive_variables=extensive,
                      intensive_variables=intensive)
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/util/util.py:28: UserWarning: nan values in variable: p_poverty_rate, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/util/util.py:28: UserWarning: nan values in variable: p_poverty_rate, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/area_weighted/area_weighted.py:253: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  den = source_df["geometry"].area.values
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/util/util.py:28: UserWarning: nan values in variable: median_household_income, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")
/Users/knaaptime/anaconda3/envs/geosnap/lib/python3.7/site-packages/tobler-0.3.1-py3.7.egg/tobler/util/util.py:28: UserWarning: nan values in variable: p_poverty_rate, replacing with 0
  warn(f"nan values in variable: {column}, replacing with 0")
sd_2010.gdf.plot('n_vacant_housing_units')
<AxesSubplot:>

png

sd_2010.gdf.plot('p_poverty_rate')
<AxesSubplot:>

png

sd_2010.gdf.plot('median_household_income').plot()
[]

png

output_4_1 output_5_1 output_6_1
sjsrey commented 3 years ago

I did a clean clone of both geosnap and tobler for this. Could there be something in your geosnap that is not upstream?

knaaptime commented 3 years ago

no i just installed it from master

cd geosnap; conda env create -f environment.yml
conda activate geosnap; python setup.py install
pip uninstall tobler -y  # uninstall conda version first 
cd ../tobler
python setup.py install  # install current master
sjsrey commented 3 years ago

This is in a clean clone of geosnap after conda env create -f environment.yml

(base) ~/D/g/g/s/geosnap ❯❯❯ conda activate geosnap                                                                                                                                            
(geosnap) ~/D/g/g/s/geosnap ❯❯❯ python setup.py install                                                                                                                                        
Traceback (most recent call last):
  File "setup.py", line 9, in <module>
    with open("README.md", encoding="utf8") as file:
TypeError: 'encoding' is an invalid keyword argument for this function
sjsrey commented 3 years ago

What version of python do you have locally (since it isn't spec'd in the environment.yml file?)

knaaptime commented 3 years ago
/Users/knaaptime/Dropbox/projects/geosnap master* ⇡ 19s
geosnap ❯ ipython
Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:37:09)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.
sjsrey commented 3 years ago

yah but there is no ipython in that environment.yml

knaaptime commented 3 years ago

i had to install ipykernel manually so that jupyter would see it

sjsrey commented 3 years ago

After starting completely over, as in fresh clones into new directories, and repeating these steps:

cd geosnap; conda env create -f environment.yml
conda activate geosnap; python setup.py install
pip uninstall tobler -y  # uninstall conda version first 
cd ../tobler
python setup.py install  # install current master

I'm still getting the nan for the intensive variables when using geosnap but not tobler.

sjsrey commented 3 years ago

It turns out, plotting works even with nan values. So can you check the head to see if you are getting nans?

sjsrey commented 3 years ago
Screen Shot 2020-08-13 at 3 42 29 PM
sjsrey commented 3 years ago
Screen Shot 2020-08-13 at 3 50 29 PM

Is something getting duplicated?

knaaptime commented 3 years ago

was just going through the smae thing. I have lots of nans but lots of values

need to look into the harmonize code closer

knaaptime commented 3 years ago

was also wondering how the tests could be passing

knaaptime commented 3 years ago

i think i see whats going on

sjsrey commented 3 years ago

My hunch is it is in the geosnap harmonization as tobler warns it encountered NANs and has replaced them with 0s. So the NANs we are seeing here are likely coming from some operation in the harmonize method.