rascully / Stream-Monitoring-Data-Exchange-Specifications

Description of the data exchange specifications for in-stream tributary habitat data.
4 stars 1 forks source link

Resolve "NA" string/text within numeric columns #41

Open edlabo opened 1 year ago

edlabo commented 1 year ago

Issue: "NA" string/text values appear within columns that should be strictly numeric data types, which results in columns being read by software as text or string rather than numeric.

Issue location(s): Analysis Ready data table @ https://github.com/rascully/Stream-Monitoring-Data-Exchange-Specifications/blob/master/DataIntegrationExample/data/AnalysisStreamHabitatMonitoringMetricDataset.csv ---> e.g., in the "Sin", "AvgBFWDRatio", "BFHeight", "Grad", etc. numeric-containing columns contain text/string "NA" values.

Location(s) in code: possibly from Line 356 below which seems to replace all blank cells/entries (indicated by '') to string 'NA': https://github.com/rascully/Stream-Monitoring-Data-Exchange-Specifications/blob/4b595a31b0e8273e405b683e28309e2b68119a5e/DataIntegrationExample/R/CombineData.R#L351-L356

and possibly something here, though I think this is doing something else/is just removing values of -99 or Inf from the data. https://github.com/rascully/Stream-Monitoring-Data-Exchange-Specifications/blob/4b595a31b0e8273e405b683e28309e2b68119a5e/DataIntegrationExample/R/CombineData.R#L511-L515

Action to take: Consider replacing the "NA" string/text values with a numeric-coded NA value, such as -9999 or some other value that will not cause the entire column to be read in as text instead of numbers. Whatever value is chosen will be defined in the metadata so users understand.

Notes: there are cases where it is ok to fill in "NA" over blank values, for string/text columns, it's just the numeric columns I think we need to be more judicious with filling in NAs.

edlabo commented 1 year ago

Something to think about when making this change, I think, is that replacing the "NA" values with a numeric -9999 (or whatever is chosen) will then result in the measurementValues column of the MeasurementOrFact table containing a bunch of nonvalid -9999 values which will clutter it up a bit.

think about whether we would potentially leave the -9999's in the MeasurementOrFact table or alternatively, add some type of filter to the code that generates the MeasurementOrFact table to filter out the -9999 from the file? not totally sure about how to go about this or what the best code method is, but something to consider.