qdread / dasypop

Code accompanying Scientific Data paper
3 stars 1 forks source link

Issues raised by Nick Tate #2

Open qdread opened 1 year ago

qdread commented 1 year ago

I recently got contacted by Dr Nick Tate of the University of Leicester. He is trying to use our dataset and code for teaching purposes (woo hoo!) He raised a couple of issues which led me to want to fix a couple of things.

@khondula if you have any thoughts about any of this, please let me know!

1. Inconsistency about road classification in NLCD impervious surface descriptor

Nick noticed that there is an inconsistency in the code vis-a-vis the comments in dasypop_methods.md. Line 188 of that document is a comment saying:

# Reclassify: keep classes 1-6 (non-road) and drop 7-14 (road)

Then Line 189 reclassifies 1-6 as 1, and 7-14 as NA. Follow that with Line 196, where all non-NA pixels are set to 0. This makes it look like non-road pixels are being removed. Actually, though, if you look at the NLCD2016 impervious descriptor metadata you can see it is the comment that is wrong and not the code. 1-6 are roads, and 7-14 are not roads. So the code is correct. When I was looking into this, I remembered that it was wrong at first but that I had corrected it at some point by judiciously adding a single ! on line 196. That explains why it is ass-backward that we make road into "non-NA", then remove "non-NA." It would make the code less confusing if we had made road NA and then removed all NAs. But it works the same either way. I will edit the comment in the code.

One thing that I did just notice, though, is that if you read the metadata closely, it looks like 7 and 8 are also roads, and 11 and 12 are energy production sites that probably don't have people living in them either. To be honest, it seems like we should have masked those classes out just as we did for the road ones. Any thoughts???

2. Accessibility of the correct version of NLCD2016

I remember that we discussed whether we should include the timestamped 2019-04-05 release of NLCD2016 with our data on FigShare so that people could reproduce our results exactly. Well, I thought that's what we had decided to do. In fact, the documentation in dasypop_methods.md states as much on Lines 99-100, and I quote:

The following Bash code unzips the two NLCD rasters (impervious surface and impervious surface descriptor) that are archived with the data. The MRLC website no longer hosts the version of NLCD 2016 (2019-04-05) that was used to generate our data product. Because of this we archived the zipped files with the data. Copy the two .zip archives into the folder input_data/NLCD within the current working directory before running the Bash code below.

OK cool, but on FigShare those timestamped NLCD2016 files are nowhere to be found. Apparently, there is a newer release of NLCD2016 that came out in 2021 with new impervious surface codes. Thus our current code fails on the newer NLCD2016 currently on the MRLC site. MRLC doesn't host older versions. Luckily, I found a site hosting the 2019-04-05 release so I could point Nick to it.

But my question is, did I just have a brain fart and forget to put those NLCD impervious files on FigShare or did we decide to not put them there, and I forgot about it? I feel like it's the former. Anyway, I think a short term solution is to point people to the sciencebase site where the files are currently hosted, but really we should just put the 2019-04-05 NLCD impervious files onto FigShare. Sound good?

khondula commented 1 year ago
  1. agree, seems like only classes 9 and 10 in impervious description layer make sense!

  2. I don't remember either but makes sense to me. could make a new gh release with a note about the change in codes?

qdread commented 1 year ago

@khondula thanks for taking a look! I will certainly make a new release fixing the wrong documentation and at least making a note saying the new NLCD2016 release won't work and pointing people toward the place where the 2019 version is. Eventually I will stick a copy of it on figshare too.

As for actually redoing the dasy with everything except 9 and 10 removed, I am not sure it would be worth it. Maybe just make a note in the docs? (lazy way out)

khondula commented 1 year ago

true... is there any easy way to tell how much/fraction of area those categories account for?

qdread commented 1 year ago

I'll look into that.

qdread commented 1 year ago

In 471ff91 I made a note about it potentially being better to remove other codes besides just 1-6. Once we make a final decision about what to do about NLCD April 2019 availability and about the possibly suboptimal way we excluded roads, I can make a new github release which should automatically update the "latest DOI" on Zenodo

qdread commented 1 year ago

I ran gdalinfo -hist on the impervious descriptor raster image. Here is the output. As a reminder, 1-6 are the ones we originally called roads, 7 and 8 are unpaved roads, 9 and 10 are buildings so we think most probably populated, and 11 and 12 are things like well pads. 13 and 14 do not appear to be used.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

value | count -- | -- 0 | 8466589065 1 | 3351751 2 | 9466704 3 | 14079487 4 | 35218304 5 | 64296957 6 | 169216155 7 | 1595 8 | 35079747 9 | 129783671 10 | 48409500 11 | 13018 12 | 2358094 13 | 0 14 | 0

The sum of codes 7,8,11,12 is ~37 million (no pixels are 13 or 14), with the vast majority of those being code 8 (~35 million). The sum of codes 9 and 10, which are the pixels most likely to be populated, is ~178 million. So the sum of 7-12, that we originally assumed were populated, is ~215 million. That means if we would go from saying 7 thru 12 are populated, to saying only 9 and 10 are populated, the number of populated pixels among which we would distribute population would decrease by 17%. Not a negligible number in my thinking. @khondula I am inclined to just leave this as a cautionary note instead of trying to correct or modify the publication, as "all models are wrong," but it does seem like a pretty meaningful difference.

khondula commented 1 year ago

Thanks for checking on that @qdread ! That does seem high enough where it could affect things however hard to say without looking into it more. Maybe it could be in a section in the readme for "known issues" ?