mkoohafkan / cder

R Interface to the CDEC Web API
https://mkoohafkan.github.io/cder/
2 stars 0 forks source link

Parsing problem, and problem with problems() #9

Closed scantle closed 1 year ago

scantle commented 1 year ago

Very useful package. First issue encountered today: cdec_query() gave me a warning about a parsing issue:

> sgn <- cdec_query('SGN', 20, start.date="1990-10-01", end.date = "2022-09-30")
Warning message:
One or more parsing issues, see `problems()` for details 

I had some trouble following up on the warning, too:

> problems()
Error in problems() : could not find function "problems"

Loading readr prior to calling cdec_query, allowed me to see the "problem":

> problems()
# A tibble: 2 × 5
    row   col expected actual file                                                         
  <int> <int> <chr>    <chr>  <chr>                                                        
1  6237     7 a double BRT    C:/Users/lelan/AppData/Local/Temp/RtmpgFxiLf/file50d44d0b4d40
2 31687     7 a double BRT    C:/Users/lelan/AppData/Local/Temp/RtmpgFxiLf/file50d44d0b4d40

However, that doesn't help me know what data I'm missing

  1. Perhaps the SGN gauge example can help you improve the parser
  2. Might need some sort of handler for when readr issues a warning

Thanks!

mkoohafkan commented 1 year ago

Should be easy to

1) reexport readr::problems() so that you can review that output without loading readr 2) write the results to a file on error so that you can manually review the issue

Your specific problem is actually an issue with how whoever maintains the SGN station is writing output. Lines 6230:6240 are below, note that problems() signaled an issue with line 6237:

SGN,E,20,FLOW,20211009 0315,20211009 0315,2, ,CFS
SGN,E,20,FLOW,20211009 0330,20211009 0330,2, ,CFS
SGN,E,20,FLOW,20211009 0345,20211009 0345,BRT, ,CFS     # <-- problem
SGN,E,20,FLOW,20211009 0400,20211009 0400,2, ,CFS
SGN,E,20,FLOW,20211009 0415,20211009 0415,2, ,CFS
SGN,E,20,FLOW,20211009 0430,20211009 0430,2, ,CFS

The other line is the same issue. Lines 31685:31690 are below, note that "---" is recognized by the parser as missing data and is replaced with NA:

SGN,E,20,FLOW,20220702 0530,20220702 0530,6, ,CFS
SGN,E,20,FLOW,20220702 0545,20220702 0545,---, ,CFS
SGN,E,20,FLOW,20220702 0600,20220702 0600,BRT, ,CFS    # <-- problem
SGN,E,20,FLOW,20220702 0615,20220702 0615,6, ,CFS
SGN,E,20,FLOW,20220702 0630,20220702 0630,6, ,CFS
SGN,E,20,FLOW,20220702 0645,20220702 0645,6, ,CFS

As you can see, they wrote "BRT" to the numeric column, which obviously will cause issues for the parser (and dataframes in general). If you look at the query page you'll see their note at the bottom:

BRT and ART signify discharge at stage below or above available rating table

They should be putting those flags in the "data flag" field instead of the "value" field (at least, that's where I would put it). I don't think cder should try to handle this problem directly or try to guess what the data producer "meant" since these codes are station-dependent. I will look into readr::read_csv() a bit more and see if I can capture those bad lines in some way rather than dropping them altogether (which is what currently happens).

I also recommend reaching out to the SGN data managers and letting them know of the issue. Some data managers on CDEC are very responsive others... less so.

scantle commented 1 year ago

FYI, I followed up with CDEC and they confirmed the various flags are intended to appear in the value field. From their FAQ:

Some of the data is missing, and in the place of a numerical value there is either "ART","BRT", or "--". What do they stand for? "ART" stands for Above Rating Table, "BRT" stands for Below Rating Table, and "--" stands for missing value...

Which, again, implies the value column is to be used for flags. I guess the DATA_FLAG column in ornamental? I agree it would be much easier (for those of us on this end) if the value column was purely numeric.

Your latest changes, however, have made it so my code works as intended. Thanks!