Explore MissingValue approaches

GoogleCodeExporter commented 8 years ago

This is related to issue 114 and how Missing Values are handled in Python.

The problem:
The current approach of raising a warning and setting Nulls to None is clunky 
at best. Warnings pollute test output when missing values are expected (ArcGIS 
DBF weights).  The python syntax for catching warnings is unfriendly and 
confusing. On top of that Null values are perfectly valid in many cases. Most 
databases allow for missing values.

The default replacement of None causes all kinds of problems too.

>>> X = np.array([1,2,None,3])
>>> X*2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Which requires checking for missing values before computations, or risk getting 
cryptic errors back from pysal's methods.

Suggestions:
1. Disable the warnings
The warnings seem redundant and inappropriate.  Missing Values are allowed at 
the database level. And error will be raised if computations are performed on 
None values (not true or if follow suggestion 2).

2. Replace missing values with float('nan')
Sticking None into an array of floats is a bit odd, since None is not a valid 
float and can't be treated as such.
>>> 5+None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Not-A-Number of the other hand is a valid float and can be used in calculations.
>>> 5+float('nan')
nan

However, nan can produce some strange results if you don't look for it.
>>> x = float('nan')
>>> x!=x #this is how you test for nan
True
>>> max([x,1])
nan
>>> max([1,x])
1
>>> #why? because,
>>> x>=1
False
>>> x<=1
False
>>> np.nanmax([x,1])
1

Original issue reported on code.google.com by schmi...@gmail.com on 7 Dec 2011 at 7:55

GoogleCodeExporter commented 8 years ago

This has been a recurrent question when I present GeoDaSpace, particularly to 
economists. Many packages just completely remove the full row for all the 
computations, but I think that's trickier in the spatial case.

Would it be worth taking a look at masked arrays?

http://docs.scipy.org/doc/numpy/reference/maskedarray.html

Original comment by dreamessence on 13 Dec 2011 at 3:57

GoogleCodeExporter commented 8 years ago

Original comment by schmi...@gmail.com on 30 Jan 2012 at 2:59

Added labels: Milestone-1.4

GoogleCodeExporter commented 8 years ago

Go with suggestion 1.  Leaving none's in place allows users to identify 
problems because tracebacks will be issued when math is attempted on None 
values.

Original comment by schmi...@gmail.com on 1 May 2012 at 9:10

GoogleCodeExporter commented 8 years ago

Warning disabled in r1267.

Original comment by schmi...@gmail.com on 12 Jul 2012 at 9:28

Changed state: Fixed

sukri12 / pysal

Explore MissingValue approaches #185