lgsmith295 / pcrecon

Principal Component Regression analysis for tree ring data
MIT License
2 stars 0 forks source link

move file name handling from load_crns to parse_json #27

Closed djhocking closed 4 years ago

djhocking commented 4 years ago

move file name handling from load_crns to parse_json because it's too complicated to try to parse it from the highly inconsistent file name system in the ITRDB. Hopefully the metadata in the json files are sufficient to do this. I broke the load_crns function when trying to do it within that while making it convoluted and unclear.

nnnagle commented 4 years ago

Putting this here in case it helps. I had a script that downloaded all crns for North America. -a function that reads the crn header. -I read in all files with a .crn extension from a directory. -Filtered out exotics crn's (Lakewood, etc) -looped through the files, processing the header, and extracting the lat/lon -throw away anything not formatted correctly (5% iirc)

IIRC correctly (a few years ago), I found inconsistencies between the metadata and the crn header and decided to go with the crn header.

read.crn.head <- function(fname){
  header <- readLines(fname, n=4)
  crn <- try(dplR::read.crn(fname))
  if(class(crn)[1]!='try-error')
    return(data_frame(site_id=substr(header[[1]],1,6),
                      site_name=substr(header[[1]], 10, 61),
                      species_code=substr(header[[1]], 61, 65),
                      elev = substr(header[[2]], 42,46),
                      lat_lon=substr(header[[2]], 48,57),
                      years=substr(header[[2]], 68, 76),
                      first=row.names(crn)[1],
                      last=tail(row.names(crn),1),
                      fname=fname))
  else return(NULL)
}

# Get list of files in itrdb folder
files <- list.files('~/Dropbox/git_root/climate-bayes/data/itrdb', full.names=TRUE)
# Filter file names to be "standard" chronologies.  No exotics like "latewood", "arstan" etc.
files <- files[grepl('[[:digit:]]\\.crn', files)]
# Feed files through my read.crn.head
temp <- lapply(files,read.crn.head)
# rbind that into a data.frame
temp <- do.call(rbind, temp)
# filter that data.frame to have valid lat-lon strings and valid year ranges
df <- temp %>% filter(grepl(lat_lon, pattern='[[:digit:]]{4,4}-[[:digit:]]{5,5}')) %>% filter(last<2015)
# Create properly formatted lat/lon coordinates (convert from dddmm to decimal)
df <- df %>% mutate(lat=as.numeric(substr(lat_lon, 1, 2))+as.numeric(substr(lat_lon, 3, 4))/60,
                    lon=as.numeric(substr(lat_lon, 6, 8))+as.numeric(substr(lat_lon, 9, 10))/60)
djhocking commented 4 years ago

Thanks, this is all really something special. We hadn't thought about parsing the header and had been going to the metadata. It's nice to see this as an example. I had something like what you did:

# Filter file names to be "standard" chronologies.  No exotics like "latewood", "arstan" etc.
files <- files[grepl('[[:digit:]]\\.crn', files)]

But there were some standard chronologies that Laura wanted that were just tn.crn with no numbers, so those would get skipped. But now looking at the header info, there is standardization and chronology type codes that we could use. Then all files could be parsed and subsequently filtered. Although maybe this is all a bit of a waste if people are really only interested in the standard chronologies for these big automated projects.

It would be interesting to look more closely at the inconsistencies between the metadata and the header at some point. If it wasn't so terrible to work with, someone should do an audit of the data and those inconsistencies (and also make a real database with QAQCs at some point).

lgsmith295 commented 4 years ago

FWIW, the large-scale seasonal precipitation atlas that just came out did include earlywood widths in the analysis since they were looking for datasets with winter/early spring precip signals. And the Idaho lab is working on a big blue light reflectance (cell density) dataset network for temperature recons. So I think being able to parse and filter these is useful.

djhocking commented 4 years ago

@lgsmith295 - that's a good point

Strategy:

  1. Vector of location names (e.g. nm537) from spatial filtering = crns
  2. Vector of all filenames in the directory = files
  3. Filter to filenames containing the locations names (e.g. nm537.crn and nm537r.crn)
  4. Parse those headers into a dataframe including info about "exotics"
  5. Filter by measurement types and chronology types #23
  6. Pass the resulting list of filenames to the read_crn function

This should get all the readable and parseable data from the directory with a single chronology and measurement type. Then if the user wants to add another measurement type, they could run the function again and combine data sets. Alternatively, we could add the ability to have multiple types within a call and pass those to the filter (e.g. type_m = c("early", "late"), filter(measure %in% type_m)).

djhocking commented 4 years ago

Worse than I realized. The space (b/w lines 57 & 67) on the second header line where residual _R is denoted when a residual chronology is not officially used for anything in the ITRDB documentation. But it is frequently used with all sorts of unidentified things. Various numbers is presumably standard chronologies. Apparently no systematic way to get the chronology type from the header info. Need to go back to the super well organized and consistent metadata (drips sarcasm) after 2 days of working on this and thinking I finally had it working.

a part of soul dies. begins sobbing

djhocking commented 4 years ago

I've become convinced that there is no way to globally, cleanly parse the measurement type and chronology type from filenames, file headers, or metadata. The metadata actually might be the least clean option. I think that filenames might be the best to get most of the info. But if there is a both chronology and measurement alternatives then only one shows up (e.g. latewood residual chronology). The measurement type seems of most interest so try to focus on getting the standard chronologies for each measurement type. It also seems the least problematic in the header data (the metadata are terrible - neither consistently computer readable or easily human readable).

New Strategy: parse by filename and check in header and warn or throw error if they are in true conflict (as opposed to just missing in one piece - then just warn or message). Use the last 5 characters to check the ending (e.g. [2 alpha][digits]**[1 alpha][.crn]) and check that the whole thing before the .crn has more than 2 digits. This will only work in the US but that will have to do for now. Separate parsers can be made in the future upon request.

djhocking commented 4 years ago

Just a dump of the metadata code using rjson rather than jsonlite which got a bit more info. Probably not useful since the metadata are so terrible but at least this will be somewhere if we need to find it again:

files <- list.files(dir)

# remove manifest
files <- files[which(files != "manifest.json")]
files <- files[stringr::str_detect(files, "\\.json$")] # only json files
# files <- files[!(stringr::str_detect(files, "-noaa"))] # deal with noaa separately?

n_files <- length(files)

# set up empty item to make into a dataframe
df_meta <- NULL

df <- rjson::fromJSON(file = file.path(dir, files[i]))

df$studyCode
df$earliestYearCE
df$mostRecentYearCE
df$site$siteName
df$site[[1]]$geo$geometry$coordinates
df$site[[1]]$geo$properties$southernmostLatitude
df$site[[1]]$geo$properties$northernmostLatitude
df$site[[1]]$geo$properties$westernmostLongitude
df$site[[1]]$geo$properties$easternmostLongitude
df$site[[1]]$geo$properties$minElevationMeters
df$site[[1]]$geo$properties$maxElevationMeters
df$site[[1]]$paleoData[[1]]$species[[1]]$speciesCode
df$site[[1]]$paleoData[[1]]$species[[1]]$scientificName
df$site[[1]]$paleoData[[1]]$species[[1]]$commonName
df$site[[1]]$paleoData[[1]]$dataFile[[1]]$urlDescription
df$site[[1]]$paleoData[[1]]$dataFile[[1]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[1]]$variables
df$site[[1]]$paleoData[[1]]$dataFile[[2]]$urlDescription
df$site[[1]]$paleoData[[1]]$dataFile[[2]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[2]]$variables
df$site[[1]]$paleoData[[1]]$dataFile[[2]]$variables[[1]]$cvShortName

length(df$site[[1]]$paleoData[[1]]$dataFile)
df$site[[1]]$paleoData[[1]]$dataFile[[1]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[2]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[3]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[4]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[5]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[6]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[7]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[8]]$linkText
df$site[[1]]$paleoData[[1]]$dataFile[[9]]$linkText

df$site[[1]]$paleoData[[1]]$dataFile[[1]]$variables[[1]]$cvDataType # script out of bounds
df$site[[1]]$paleoData[[1]]$dataFile[[9]]$variables[[1]]$cvDataType