topepo / AmesHousing

Ames IA Housing data from De Cock (2011)
14 stars 12 forks source link

Missing Geo Codes #2

Open dmi3kno opened 6 years ago

dmi3kno commented 6 years ago

I think geo location matching is awesome feature, as it allows cross referencing of this dataset against census and other cool things. I tried to find some more matches using Property Search Form and found that in certain cases authorities adjust PID somewhat to accommodate new version of the record (at least this is what I understood is happening).

In couple of instances I was able to match the observations in the dataset to the records using Lot Area, Total Living Area, Year Built and other variables. For that Comparables Search Form was useful.

Here's a result of my efforts:

tribble(~PID,         ~newPID,      ~Longitude,  ~Latitude,
        "0904351040", "0904351045", -93.657573,  42.025255,
        "0535300120", "0535300125", -93.620016,  42.040931,
        "0902401130", "0902401135", -93.610044,  42.028556,
        "0902477120", "0902477125", -93.60520,   42.02223, # 116 BORNE AVE
        "0902477130", NA, NA, NA,
        "0906226090", "0906226090", -93.679109, 42.031623,
        "0908154040", "0908154045", -93.676608, 42.017026,  
        "0909129100", "0909129105", -93.652027, 42.019713,
        "0912251110", "0912251115", -93.58979, 42.01829, # 412 FREEL DR  414
        "0914465040", "0914465043", -93.607252, 41.996672,
        "0902103150", "0902103145", -93.620050, 42.032611,
        "0902401120", "0902401125", -93.610008, 42.028397,
        "0904101170", NA, -93.65616, 42.03112, #1003 HYLAND AVE
        "0909201110", "0909201115", -93.64663, 42.01946, #319 LYNN AVE
        "0916253320", "0916256880", -93.647403, 42.001694,
        "0916477060", "0916477065", -93.647199, 41.993986,
        "0916252170", "0916256455", -93.64621, 42.00124, # 2412 HAMILTON DR
        "0916325040", "0916325045", -93.651077, 41.999840
        )

Longitude and Latitude are geo-coded using google maps. Hope you will find these useful and include in the new version of the package.

topepo commented 6 years ago

Thanks for finding these. It took a while to get decent coordinates for a few dozen houses.

Please double check that these were merged in correctly. I changed the PIDs in the geo data set but not the raw; make_ames does those corrections.

Did you notice any erroneous values in the data when you worked with it on Kaggle? I see two properties that are always(wildly) mis-predicted in the demo models that I've run.

I'll target sending this to CRAN at the end of the week, so let me know if you see anything wrong with the updates for these two issues before then.

dmi3kno commented 6 years ago

I will have a look at the merged data later today. Bruce Hoppe tipped me on the data which makes it possible to download 20k+ records at once.

The houses that I most struggled with were also the ones Prof De Cock described in the notes:

There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.

I did a little bit of research on Kaggle regarding those five houses (three "partial sale" houses in Edwards and two upscale in Northridge):

In fact all three of Edwards properties were sold within three months by the same company. The lots are located next to each other in Cochrane Pkwy Ames, IA. I guess the story was that developer was either in trouble or wanted to boost sales, so they were signing sales contracts on half-finished houses in a new development area at a deep discount. Those few houses are likely to have been built first out of the whole residential block that followed.

Two of the upscale houses in Northridge (sold at 745 and 755 thousand, respectively) are located next to each other. Houses are, of course, eyecandies (at least to my taste). They are outliers with regards to size in their own neighborhoods!

If you decide to drop these five properties, I will feel no sorrow. Everything else is kind of precious and I would make an effort to keep it.

topepo commented 6 years ago

Those are the ones. I'll keep them in since they are a good example of data investigation (hopefully students will be able to identify them).

I will have a look at the merged data later today. Bruce Hoppe tipped me on the data download link on Ames Assessors Office website which makes it possible to download 20k+ records at once.

That would have saved me some time!

dmi3kno commented 6 years ago

The join seems fine. I geo-coded the remainder of the data points

The following locations are interpolated from neighboring PIDs and verified on Google.

tribble(~PID,         ~newPID,      ~Longitude,  ~Latitude,
        "0902477120",  NA,          -93.605207,       42.023218, # empty lot, 505 E Lincoln Way, Ames, IA 
        "0902477130",  NA,          -93.605421,       42.023222 # empty lot, 509 E Lincoln Way, Ames, IA
        "0912251110",  NA,           -93.588227,      42.018300, # empty lot, 412 Freel Drive, Ames, IA.
        "0904101170",  NA,           -93.657031,      42.031281, # empty lot 1015 N Hyland Ave
        "0909201110",  NA,           -93.647024,      42.019272,  # now condos 2309 Knapp St, Ames,
        )
topepo commented 6 years ago

Pretty impressive! Just added them.

dmi3kno commented 5 years ago

I started using AmesHousing and find it slightly inconvenient that there's a function make_ames(), but not make_ames_geo(). In particular, ames_raw is now not joinable to ames_geo outside of make_ames() function because of unmatched PIDs. I understood that you added "comma" after PID to differentiate between "original" and "modified" lon-lat pairs. Not only does it make PID field slightly inconsistent and ugly, but it also raises new questions.

1) Where does the "original" lon-lat data come from? It looks like it is plain geocoding from addresses found by PID (but then you should have been missing 0916403040, 0911175410 and 0923125030). 2) Is there a better way to record data provenance? I think script such as make_ames() is a nice way of being explicit about the changes. 3) What prevents us from recording original street address? It can be easily reverse-geocoded, but I think it makes all sense to store valid mailing address, which should be easy to find by the PID. 4) Would you consider including more spatial data into the package for pretty printing? Some ideas: