massbays-tech / MassWateR

R package for working with Massachusetts surface water quality data
https://massbays-tech.github.io/MassWateR
Creative Commons Zero v1.0 Universal
11 stars 3 forks source link

Text times sent to WQX poorly #53

Closed ben-wetherill closed 1 year ago

ben-wetherill commented 1 year ago

If the Activity Start Time in the Results File is a text field instead of time format, the tabMWRwqx() function doesn't correctly pass times that are even hours. For example, if the time is entered as "09:00", tabMWRwqx() passes "09". This will not work in the WQX upload. When the time is in time format as 09:00 AM, it correctly passes as 09:00. Non-even hours are currently handled correctly ("09:24" is passed as 09:24).

While fixing this issue, we should also make sure that all times are passed to WQX in 24-hour format. If the Results file, has "09:00", it should be assumed to be 9:00 AM and we pass 09:00. If the results file has 09:00 PM, then we should pass 21:00.

fawda123 commented 1 year ago

@ben-wetherill this should be fixed now. Dealing with date and time formats importing directly from Excel is tough since Excel stores those values using really strange conventions. This is especially challenging because there is no "time" class in R, only "datetime". Either way, it should be resolved.

ben-wetherill commented 1 year ago

I now get this error if I format a time as text. Use attached file.

> resdat <- readMWRresults(respth)
Running checks on results data...

    Checking column names... OK
    Checking all required columns are present... OK
    Checking valid Activity Types... OK
    Checking Activity Start Date formats... OK
    Checking depth data present... OK
    Checking for non-numeric values in Activity Depth/Height Measure... OK
    Checking Activity Depth/Height Unit... OK
    Checking Activity Relative Depth Name formats... OK
    Checking values in Activity Depth/Height Measure > 1 m / 3.3 ft... OK
    Checking Characteristic Name formats... OK
    Checking Result Values... OK
    Checking QC Reference Values... OK
    Checking for missing entries for Result Unit... OK
    Checking if more than one unit per Characteristic Name... OK
    Checking acceptable units for each entry in Characteristic Name... OK

All checks passed!
Error in `dplyr::mutate()`:
ℹ In argument: `hr = sprintf("%02d", hr)`.
Caused by error in `sprintf()`:
! invalid format '%02d'; use format %f, %e, %g or %a for numeric objects
Run `rlang::last_trace()` to see where the error occurred.

Sample_Results_noQCRef.xlsx

fawda123 commented 1 year ago

Ah shoot, dealing with mixed formats in the same Excel column is a pain. The times are read in as proportions from 0 - 1, except the single text entry which is read as is. I think I got it fixed though. I added some more tests to verify all these strange formats are handled correctly.

library(MassWateR)
library(testthat)

respth <- system.file('extdata/ExampleResults.xlsx', package = 'MassWateR')
resdat <- suppressWarnings(readxl::read_excel(respth, na = c('NA', 'na', ''), guess_max = Inf)) %>% 
  dplyr::mutate_if(function(x) !lubridate::is.POSIXct(x), as.character)

reschk <- resdat[1:10,]
reschk$`Activity Start Time` <- c(0.1, 0.5, '08:23:00', '16:22:00', '21:00:00PM', '09:15:00PM', '9:21', '07:56AM', '07:56 PM', '07:56PM')
frmchk <- formMWRresults(reschk)
result <- frmchk$`Activity Start Time`
expected <- c("02:24", "12:00", "08:23", "16:22", "21:00", "21:15", "09:21", "07:56", "19:56", "19:56")
expect_equal(result, expected)

In the above example, the entries here: c(0.1, 0.5, '08:23:00', '16:22:00', '21:00:00PM', '09:15:00PM', '9:21', '07:56AM', '07:56 PM', '07:56PM')

Should convert to these entries: c("02:24", "12:00", "08:23", "16:22", "21:00", "21:15", "09:21", "07:56", "19:56", "19:56")

ben-wetherill commented 1 year ago

I'm running into other issues now (see below).
After looking at this more, I'm starting to wonder if maybe we shouldn't try to handle text. I noticed that the Activity Start Date field does not accept text at all. It returns an error in the readMWRresults() function, and in our instructions we specifically say it should be formatted as Date in Excel. We also have the same note in our instructions for Activity Start Time that it should be formatted as Time in Excel. I'm leaning toward saying we should return an error if times are formatted as text. What do you think?

Here are the issues I found:

  1. The date output has now changed. It used to be "2022-09-11", now it is "2022-09-11 00:00:00 UTC". It still works in WQX, but this is confusing to show the time of 00:00:00 when the time is actually in another column. Can you put it back to just the date?
  2. 24-hour dates in Date format are not correctly captured. Example: 18:50 in date format becomes 06:50 in WQX with no PM indication. See row 56 in the attached file. This needs to be fixed.
  3. In date format, 8:00 became 07:60. See row 501. This needs to be fixed.
  4. In text format, "09:00" became 08:35. See row 58.
  5. In text format, "07:25:15: became 06:15. See row 60.
  6. In text format, "10:14 AM" became 10:19. See row 61.
  7. In text format, "18:45" became 07:45. See row 63.

Sample_Results_times.xlsx

fawda123 commented 1 year ago

@ben-wetherill sorry about that, I think these remaining issues are fixed now.

  1. This is fixed, not sure why that was happening. It doesn't do this for the example file in the package.
  2. I don't see this error on my end, all of the 24 hour dates are correct. See the attached wqx output.
  3. Fixed, although again I'm not sure why this happened for this particular record and not others.
    4 - 7: Are you sure you're reading the row numbers correctly? If you're looking at resdat with View(), the row indices start at one for the first row and not the header like in Excel, e.g., row 58 in Excel is 57 in R, etc. These times are correct on my end. The wqx output is attached.

wqxtab.xlsx

ben-wetherill commented 1 year ago

Marcus, I'm sorry about that. I was forgetting to update my fset when I was testing, so when I made changes to the results file they were not being evaluated. Most of these issues were invalid. All of them look good now.

However, there is a new issue. All of the records that do not have a time are returning "NA:NA" in the WQX output file. Don't forget about the logic for QA times. They should have a fictitious time that is unique for each set of duplicate/references, starting with "00:01".

For your records, attached is the current WQX Output Design document. Note that I made one correction in the document. In the Logic worksheet, in the Activity ID logic, I had forgotten to update the formula after we added the Record ID User Supplied field. It is now updated. I think your formula is working correctly, but this is just for records. WQX_Output_Design_7-9-23.xlsx

And remember, I'm fine if we decide to just not accept text times.

fawda123 commented 1 year ago

Okay, the NA:NA entries are fixed now. Apologies as that was introduced with the new time handling for text inputs. I'd rather handle text inputs appropriately as opposed to returning an error. This shouldn't be an issue if users are following the input guide, but it does make the inputs more flexible if text times are included.

ben-wetherill commented 1 year ago

This looks good now. Thank you.