rBatt / trawl

Analysis of scientific trawl surveys of bottom-dwelling marine organisms
1 stars 0 forks source link

Cleaning temperature data #22

Closed mpinsky closed 8 years ago

mpinsky commented 9 years ago

I have spent some time looking for outlier surface and bottom temperature values in each region that may be mistakes. There are some that I had not caught in the code for the 2013 Science paper. My latest cleaning code (from my range projection project) is below. It may be useful?

wcann$surftemp = NA # field is not collected, apparently (or was not provided)

# Newfoundland one decimal place. 900 means negative
i = newf$surftemp >= 900 & !is.na(newf$surftemp)
newf$surftemp[i] = -(newf$surftemp[i] - 900)/10
i = newf$surftemp < 900 & newf$surftemp > 0 & !is.na(newf$surftemp)
newf$surftemp[i] = newf$surftemp[i]/10
i = newf$bottemp >= 900 & !is.na(newf$bottemp)
newf$bottemp[i] = -(newf$bottemp[i] - 900)/10
i = newf$bottemp < 900 & newf$bottemp > 0 & !is.na(newf$bottemp)
newf$bottemp[i] = newf$bottemp[i]/10

# Fix -9999 to NA for SST and BT
ai$BOT_TEMP[ai$BOT_TEMP==-9999] = NA
ai$SURF_TEMP[ai$SURF_TEMP==-9999] = NA
ebs$BOT_TEMP[ebs$BOT_TEMP==-9999] = NA
ebs$SURF_TEMP[ebs$SURF_TEMP==-9999] = NA
goa$BOT_TEMP[goa$BOT_TEMP==-9999] = NA
goa$SURF_TEMP[goa$SURF_TEMP==-9999] = NA

# The SST entries on Scotian Shelf in 2010 and 2011 appear suspect. There are very few (as opposed to >1000 in previous years) and are only 0 or 1. There are no entries in 2009.
scot$SURFACE_TEMPERATURE[scot$year %in% c(2009, 2010, 2011)] = NA

# Turn 0 values in GoMex to NA. These are outliers (way too cold) and must be mistakes.
i = which(gmex$TEMP_SSURF == 0)
gmex$TEMP_SSURF[i] = NA
i = which(gmex$TEMP_BOT == 0)
gmex$TEMP_BOT[i] = NA

#0 values in ai July and goa July are much lower than other values, seem suspect
ai$SURF_TEMP[ai$month == 7 & ai$SURF_TEMP==0] = NA
goa$SURF_TEMP[goa$month == 7 & goa$SURF_TEMP==0] = NA
rBatt commented 9 years ago

Nice, thanks.

Are these issues that need to be fixed in the website code?

If so, could you create an issue there and link to the lines of the code that need the change?

If you can make these changes to the website code yourself, could you link that commit (commit of corrections to website code) in a comment on this issue (issue of cleaning temperatures in trawl repo)?

If you don't make and issue that links to line numbers or do the commit (that would allow me to see what pieces of code were changed), could you tell me which corrections are the new ones?

Also, as a general approach, rather than specifying the year and the month etc where an error exists, is there logic than can be applied that is more general? I.e., is there something specific about the temperature value itself that is flawed? E.g., if any temperature was ever below a certain value it should be NA, or if the value is way too cold for a region (e.g., if data is a data.table of trawl values, data[region=="gmex" & stemp < 5, stemp:=NA]).

If it comes down to have a collection of manually-identified errors, we should format them into a 2D structure, save them as a .csv or .txt file, then right code to update the object based on the contents of that file. That way we have a single file that explicitly states the manual corrections we're making (easier to track), and then the code becomes less bloated.

Or, in the least, we could have a separate R script that executes some of the cleaning.

mpinsky commented 9 years ago

The OceanAdapt code doesn't deal with temperature (yet).

I don't believe there is any specific logic that could be used universally.

On Mon, Feb 2, 2015 at 9:02 AM, Ryan Batt notifications@github.com wrote:

Nice, thanks.

Are these issues that need to be fixed in the website code?

If so, could you create an issue there and link to the lines of the code that need the change?

If you can make these changes to the website code yourself, could you link that commit (commit of corrections to website code) in a comment on this issue (issue of cleaning temperatures in trawl repo)?

If you don't make and issue that links to line numbers or do the commit (that would allow me to see what pieces of code were changed), could you tell me which corrections are the new ones?

Also, as a general approach, rather than specifying the year and the month etc where an error exists, is there logic than can be applied that is more general? I.e., is there something specific about the temperature value itself that is flawed? E.g., if any temperature was ever below a certain value it should be NA, or if the value is way too cold for a region (e.g., if data is a data.table of trawl values, data[region=="gmex" & stemp < 5, stemp:=NA]).

If it comes down to have a collection of manually-identified errors, we should format them into a 2D structure, save them as a .csv or .txt file, then right code to update the object based on the contents of that file. That way we have a single file that explicitly states the manual corrections we're making (easier to track), and then the code becomes less bloated.

Or, in the least, we could have a separate R script that executes some of the cleaning.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/22#issuecomment-72493892.

rBatt commented 9 years ago

@mpinsky I have not yet implemented these fixes, and could be related to the low temperature values in #30. I see in your code that some of those fixes involve changing 0's in gmex to NA's.

I haven't gotten around to these yet because there isn't always a simple 1-1 comparison between our code.

I'll need to add this to the master list of data verifications that need to happen (along with taxonomic ID's changing)

rBatt commented 8 years ago

see #36