read_forecast() fails to write sqlite file with UNIQUE constraint failure

mpartio commented 3 years ago

For some very strange reason read_forecast() is unable to write fctable sqlite files, but fails with "UNIQUE constraint failed".

A stripped example with just one MEPS member, original task extracts data for all members. Latest Harp code from github is used.

vfldMEPS_prodmbr000202103220000.zip

$ cat vfld2sqlite.R 
library(harpIO)

read_forecast(
  start_date      = "202103220000",
  end_date        = "202103222100",
  fcst_model      = "MEPS_prod",
  parameter       = NULL,
  by              = "3h",
  lead_time       = 0,
  members         = 0,
  lags            = 0,
  file_path       = "/tmp/",
  file_template   = "vfld{eps_model}mbr{MBR3}{YYYY}{MM}{DD}{HH}{LDT2}",
  output_file_opts = sqlite_opts( path = "/tmp/" , template = "fctable" )
)
$ Rscript vfld2sqlite.R 
Reading /tmp//vfldMEPS_prodmbr000202103220000
Doing height correction for 2m temperature.
Opening connection to: /tmp//MEPS_prod/2021/03/FCTABLE_CCtot_202103_00.sqlite
Writing data
<Rcpp::exception: UNIQUE constraint failed: FC.fcdate, FC.leadtime, FC.SID>
[1] "FAILURE dbSendQuery 1 / 20"
[1] "UNIQUE constraint failed: FC.fcdate, FC.leadtime, FC.SID"
<Rcpp::exception: UNIQUE constraint failed: FC.fcdate, FC.leadtime, FC.SID>
[1] "FAILURE dbSendQuery 2 / 20"
[1] "UNIQUE constraint failed: FC.fcdate, FC.leadtime, FC.SID"

andrew-MET commented 3 years ago

I cannot reproduce the problem. This works for me:

read_forecast(
  2021032200, 
  2021032200, 
  "MEPS_prod", 
  NULL, 
  lead_time        = 0, 
  members          = 0, 
  file_path        = file.path(Sys.getenv("HOME"), "data/harp/vfld"), 
  file_template    = "vfld_eps", 
  output_file_opts = sqlite_opts(path = tempdir())
)

(Note for file_path I'm just getting my $HOME/data/harp/vfld directory and the "vfld_eps" template is shorthand for "{file_path}/{fcst_model}/vfld{fcst_model}mbr{MBR3}{YYYY}{MM}{DD}{HH}{LDT2}"

Do you get the same problem if you remove the sqlite file and start again? It could be that repeated attempts to get it working messed something up in the file.

To double check it worked I tested:

read_point_forecast(
  2021032200, 
  2021032200, 
  "MEPS_prod", 
  "EPS", 
  "T2m", 
  lead_time = 0, 
  members   = 0, 
  file_path = tempdir()
)

which also worked fine.

mpartio commented 3 years ago

Thank you for conforming that it works on your environment. I have a suspicion that this could be something metcoop-related but the problem is driving me nuts.

I rebooted the server in question and now I can also run the example without problems. However the same problem appears systemically If I try to extract simultaneously from 3 members, whereas extracting from 1 or 2 members works. Then trying the same exact script on another server works just fine (with same source files).

I have to then dig deeper into the environment and see if something there is somehow restricting the write.

andrew-MET commented 3 years ago

You could check the data by putting a breakpoint in the code:

In write_fctabe_to_sqlite.R, just before line 90, you could add:

browser()

And rebuild the package with devtools::load_all()

Then when you run (interactively) you will enter the debugger and you can check if there are in fact duplicates in the data - which is what the error message implies.

library(dplyr)
data %>% group_by(!!!syms(primary_key)) %>% summarise(count = n()) %>% filter(count > 1)

This will give you any duplicated rows.

Q exits debug mode.

You could also try changing index_constraint to "none" in the call to db_clean_and_write() and see what happens.

mpartio commented 3 years ago

Thanks for the debugging tips. I will try those.

I installed Harp to a "vanilla" centos7 container and I still get the same error :thinking:

mpartio commented 3 years ago

Alright, after some debugging I think I may have found the problem, and it's not in the code nor environment.

I traced the problem down to this SQL clause in sqlite_utils.R:

  SQL_add_index <- paste(
    "CREATE",
    index_constraint,
    "INDEX IF NOT EXISTS",
    index_name,
    "ON",
    index_spec
  )

Which translates to:

CREATE unique INDEX IF NOT EXISTS index_fcdate_leadtime_SID ON FC(fcdate,leadtime,SID)

So it seems there are duplicate entries in the sqlite file because the command fails. I commented out the index creation and took a look of the sqlite file that had no index constraint:

sqlite> .schema fc
CREATE TABLE FC(fcdate INT, leadtime INT, parameter TEXT, SID INT, lat DOUBLE, lon DOUBLE, model_elevation DOUBLE, units TEXT, validdate INT, MEPS_prod_mbr000 DOUBLE, MEPS_prod_mbr001 DOUBLE, MEPS_prod_mbr002 DOUBLE);
sqlite> select fcdate, leadtime, sid, count(*) from fc group by 1,2,3 having count(*) > 1 limit 1;
1616544000|0|1045|2
sqlite> select * from fc where fcdate=1616544000 and leadtime=0 and sid=1045;
1616544000|0|CCtot|1045|69.834|21.884|61.8|oktas|1616544000|8.0|8.0|
1616544000|0|CCtot|1045|69.834|21.884|61.7|oktas|1616544000|||8.0

So why does station 1045 have two entries, one row for member 0&1 and another row for member 2?

Tracing the R code at write_fctable_to_sqlite.R I can see that the data table is "pivoted" so that the values of column "member" become columns themselves, this is done with tidyr::spread. Adding a print statement before and after:

  data %>% dplyr::filter(SID == 1045) %>% print()

  data <- data %>%
    tidyr::spread(.data$member, .data$forecast)

  data %>% dplyr::filter(SID == 1045) %>% print()

Output:

# A tibble: 3 x 11
      fcdate leadtime parameter member             SID   lat   lon model_elevation forecast units  validdate
       <int>    <int> <chr>     <chr>            <int> <dbl> <dbl>           <dbl>    <dbl> <chr>      <int>
1 1616544000        0 CCtot     MEPS_prod_mbr000  1045  69.8  21.9            61.8        8 oktas 1616544000
2 1616544000        0 CCtot     MEPS_prod_mbr001  1045  69.8  21.9            61.8        8 oktas 1616544000
3 1616544000        0 CCtot     MEPS_prod_mbr002  1045  69.8  21.9            61.7        8 oktas 1616544000

< PIVOT HAPPENS HERE >

# A tibble: 2 x 12
      fcdate leadtime parameter   SID   lat   lon model_elevation units  validdate MEPS_prod_mbr000 MEPS_prod_mbr001 MEPS_prod_mbr002
       <int>    <int> <chr>     <int> <dbl> <dbl>           <dbl> <chr>      <int>            <dbl>            <dbl>            <dbl>
1 1616544000        0 CCtot      1045  69.8  21.9            61.8 oktas 1616544000                8                8               NA
2 1616544000        0 CCtot      1045  69.8  21.9            61.7 oktas 1616544000               NA               NA                8

So it's just confirming what we see in sqlite file.

I noticed that model_elevation column has 61.8 for mbr 0&1 and 61.7 for mbr2. The difference is in the vfld-files:

$ grep "01045 " vfldMEPS_prodmbr00*202103240000 | awk '{print $1" "$2" "$3" "$4" "$5}'
vfldMEPS_prodmbr000202103240000: 01045 69.834 21.884 61.8
vfldMEPS_prodmbr001202103240000: 01045 69.834 21.884 61.8
vfldMEPS_prodmbr002202103240000: 01045 69.834 21.884 61.7

Andrew: Could this tiny difference in model_elevation (61.8 vs 61.7) result in that spread() doesn't think that the rows are equivalent and doesn't merge the rows correctly?

andrew-MET commented 3 years ago

That's exactly what will be causing the problem. It's expected that different members will have a different elevation for multi model ensembles, so we have a remove_model_elev switch in sqlite_opts() to not include the model elevation in the SQLite output - in general it's probably not needed anyway, but by default it's included - because it was requested.

I should probably add a more explicit check on model elevation so that it is clear when this causes a break of the unique constraint.

How this ended up happening with MEPS data in vfld files is another question... are these single precision members perhaps?

mpartio commented 3 years ago

These are all double precision members. Member 2 is compiled&run in a different HPC, that could mean something here. Looks to me the change was introduced with cycle 43:

vfld-files for 23.3 00 cycle (cy40):

$ pwd
/nobackup/opdata/verif/ecfadm_vfldMEPS_prod/2021/03
$ grep "01045 " 23/vfldMEPS_prodmbr00{0,1,2}202103230000 | awk '{print $1" "$2" "$3" "$4" "$5}'
23/vfldMEPS_prodmbr000202103230000: 01045 69.834 21.884 63.0
23/vfldMEPS_prodmbr001202103230000: 01045 69.834 21.884 63.0
23/vfldMEPS_prodmbr002202103230000: 01045 69.834 21.884 63.0

vfld-files for 24.3 00 cycle (cy43):

$ grep "01045 " 24/vfldMEPS_prodmbr00{0,1,2}202103240000 | awk '{print $1" "$2" "$3" "$4" "$5}'
24/vfldMEPS_prodmbr000202103240000: 01045 69.834 21.884 61.8
24/vfldMEPS_prodmbr001202103240000: 01045 69.834 21.884 61.8
24/vfldMEPS_prodmbr002202103240000: 01045 69.834 21.884 61.7

@uandrae: any idea why voima-member has different value for model elevation compared to cirrus members?

uandrae commented 3 years ago

No, but will investigate.

andrew-MET commented 3 years ago

So to prevent the problem occurring set

output_file_opts = sqlite_opts(path = "/patht/to/FCTABLE", remove_model_elev = TRUE)

This means that model_elevation won't be a column in the SQLite files, in general this shouldn't be a problem as T2m height corrections have already been done.

I've opened #76 to more explicitly describe the issue and am closing this one as the rest is outside harp's scope.

uandrae commented 3 years ago

The reason for the different is if the input is in GRIB or FA. Haven't nailed all the details yet but will create a ticket within MetCoOp to start with.

mpartio commented 3 years ago

Using remove_model_elev = TRUE works nicely, thanks!

harphub / harpIO

read_forecast() fails to write sqlite file with UNIQUE constraint failure #74