import and parsing errors can distort data

lter / soilHarmonization

Homogenize LTER Soil Organic Matter Working Group data and notes

https://lter.github.io/soilHarmonization/

Other

1 stars 4 forks source link

import and parsing errors can distort data #40

Open srearl opened 4 years ago

srearl commented 4 years ago

When reading data, import errors can change data contents. This is best explained with an example, we will use CDR_E141.

Among others, there are two treatment variables encoded in the data file e141_Plant aboveground biomass data: Water Treatment and Temp Treatment. Because there are so few records out of >65K that contain data in those columns, the read_sheet function interprets them as logical variables then overwrites records where data do exist as it does not conform to that data type. As a result, the treatment variables (tx_L3, tx_L6) are present in the homogenized output (e141_Plant aboveground biomass data_HMGZD) but the data omitted.

@wwieder

wwieder commented 4 years ago

do you have any idea how often this is happening @srearl ? If it's isolated, could we add a value to the missing values in these columns (e.g. 'NA', which is the control identifier?) This wouldn't be terrible to do manually for a few datasets, but less desirable if it's happening frequently.

srearl commented 4 years ago

I think for any file where there is not a value in the first 1000 rows of data (the default import setting). So, I suspect this is quite rare. I can rehomog CDR_E141. I can think about a way to maybe look through the data to see if and how many cases there are with a 1000 missing values but can you picture that kind of data structure occurring very often?

wwieder commented 4 years ago

I don't think this happens often. I'd be surprised if we have many datasets that are >1000 rows long. Is there a simple way to flag this when aggregating hmgz files? It may still take a manual look on particular datasets, but then we're not going to be surprised in the future (by this issue).

Would a work around be to just add NA to the first few open rows, so the rest are actually read in?

srearl commented 4 years ago

I can change the import settings to rehomog these CDR data and address problem for these data (and yet another tarball), but changing the import settings to capture this kind of structure is hugely memory intensive so not a good default generally. For the future, yeah, I think we could add a flag for this, or, at least, note it in the documentation.

wwieder commented 4 years ago

This seems fair, Stevan. All of the CDR data sets are massive. Don't know if we need to do this for multiple files of their data?

srearl commented 4 years ago

CDR 141 remohoged with temporary adjustment to guess_max param of sheet_download:

sheet_download <- function(fileId, skipRows, missingValueCode) {

  if (missing(skipRows)) {

    skipRows <- 0

  }

  if (missing(missingValueCode)) {

    missingValueCode <- "NA"

  }

  dataFile <- googlesheets4::read_sheet(
    ss = fileId,
    skip = skipRows,
    na = missingValueCode,
    guess_max = Inf
  )

  return(dataFile)

}