nawendt / gribr

GRIB interface for R using ECMWF ecCodes
BSD 3-Clause "New" or "Revised" License
24 stars 0 forks source link

Problem with non unique filename in loop #16

Closed smu closed 3 years ago

smu commented 3 years ago

Thank you for this valuable R package!

I found a possible bug related to the usage of the non unique filenames.

Please consider the following example script:

library(gribr)
library(curl)
library(R.utils)

NOTUNIQUEFILENAME <- 'NOTUNIQUEFILENAME.grb'
if(file.exists(NOTUNIQUEFILENAME)){
    unlink(NOTUNIQUEFILENAME)
}

# download test data
dl_testdata <- function(){
    # download the 00 UTC for today from opendata.dwd.de
    datestr = format(Sys.time(), "%Y%m%d")

    url.ta <- paste0('https://opendata.dwd.de/weather/nwp/icon/grib/00/t_2m/icon_global_icosahedral_single-level_',datestr,'00_000_T_2M.grib2.bz2')
    filename.ta <- 'T_2M.grb.bz2'
    curl_fetch_disk(url.ta, filename.ta)
    bunzip2(filename.ta)

    url.td <- paste0('https://opendata.dwd.de/weather/nwp/icon/grib/00/td_2m/icon_global_icosahedral_single-level_',datestr,'00_000_TD_2M.grib2.bz2')
    filename.td <- 'TD_2M.grb.bz2'
    curl_fetch_disk(url.td, filename.td)
    bunzip2(filename.td)
}

# load forecast
get_fc <- function(varname){

    filename = paste0(varname,'.grb')
    print(filename)
    file.copy(filename, NOTUNIQUEFILENAME)
    gribHandle <- grib_open(NOTUNIQUEFILENAME)
    print(gribHandle)
    print(grib_list(gribHandle))
    #print(paste('varname', varname))
    gribRecord <- grib_select(gribHandle, list(shortName = varname))
    grib_close(gribHandle)
    #print(gribHandle)
    unlink(filename)
    unlink(NOTUNIQUEFILENAME)

    return()

}

dl_testdata()

for(var in c('T_2M','TD_2M')){
    get_fc(var)
}

The first call of get_fc works fine, in the second call, the following error is thrown:

Error in grib_select(gribHandle, list(shortName = varname)) : 
  gribr: no messages matched
Calls: get_fc -> grib_select
Execution halted

It was hard to understand this error, since grib_list(gribHandle) shows, that the variable TD_2M is found in the grib file. A colleague, however, found, that the problem may be related to grib_select.

Therefore, we looked at the system calls using trace -e trace=open,close,read Rscript gribr-bug_example.R

During the first iteration, the call of grib_select opens two new file decriptors on the grib file.

open(".../NOTUNIQUEFILENAME.grb", O_RDONLY) = 10
open("/.../NOTUNIQUEFILENAME.grb", O_RDONLY) = 11

These file descriptors are not closed lateron (e.g. by grib_close). In the second iteration, another file descriptor is opened. Some ot the data, however, is still read from the file descriptor opened during the first iteration and which is still pointing to the grib file opened during the first iteration.

open("/.../NOTUNIQUEFILENAME.grb", O_RDONLY) = 12
read(11, "GRIB\377\377\0\2\0\0\0\0\0Z\0\251\0\0\0\25\1\0N\0\377\23\1\1\7\345\1\10"..., 8192) = 8192
read(11, "z:\27:\35:\377:E:\237:\2009\2539\0369c9M9\221:(9\3319\366:\n9"..., 5890048) = 5890048
read(11, "\31\224\34\224\25\224\253\224\222\224\245\224\262\224\245\224a\224\210\224}\224\251\224\305\224\270\224\251\224g

Maybe a simple f_close is missing in src/grib_select.c?

nawendt commented 3 years ago

Appreciate the report. My quick glance at things makes me think I am doing an unnecessary fopen in grib_select.c. I will be looking into it more just to be sure that my suspicions are correct.

nawendt commented 3 years ago

After looking into it, I do think I make an unnecessary fopen call. For other functions, I simply can pass the file pointer and use that. For grib_select, the eccodes function codes_index_add_file expects a file path string and not just a pointer to a file. The fopen call that is there was due to initializing a codes_handle which does need a file pointer. I did fail to do an fclose for that. The bottom line is that I can totally eliminate the fopen call by just passing the file pointer along with the file path.

After I made that change, I then used your test script to see if anything changed. I also got the same error you did still. However, I do not think the error is related to the file pointer issue that this uncovered. When I looked at the output for grib_list I noticed that the shortName that it output was '2t' and '2d', respectively. When I modified the script to use the correct shortName based on the file name then it functioned as you would expect. Please check and see if you are using the right shortName in your full script. I do need to fix the file descriptor leak anyway, but if you continue to have issues I will have to look elsewhere for what could be happening.

smu commented 3 years ago

Thanks for looking into this bug.

Unfortunatly, I can not follow your comment on the shortName. In my case the shortNames printed by grib_list are T_2M and TD_2M (see full STDOUT output below).

[1] "T_2M.grb"
GRIB FILE:  .../NOTUNIQUEFILENAME.grb 
Status: open
isMultiMessage: FALSE 
  edition centre     date dataType          gridType stepRange
1       2   edzw 20210113       fc unstructured_grid         0
        typeOfLevel level shortName packingType
1 heightAboveGround     2      T_2M grid_simple
.../NOTUNIQUEFILENAME.grb closed
[1] "TD_2M.grb"
GRIB FILE:  .../NOTUNIQUEFILENAME.grb 
Status: open
isMultiMessage: FALSE 
  edition centre     date dataType          gridType stepRange
1       2   edzw 20210113       fc unstructured_grid         0
        typeOfLevel level shortName packingType
1 heightAboveGround     2     TD_2M grid_simple
Error in grib_select(gribHandle, list(shortName = varname)) : 
  gribr: no messages matched
Calls: get_fc -> grib_select
Execution halted

Do you get a different output on your system?

nawendt commented 3 years ago

Yes. I get '2t' and '2d'. In the back of my mind I wondered if we had different grib definition files where yours were locally modified to output that shortName. Perhaps that explains the difference? All the other outputs are the same, however. What confuses me is that I can create a new conda environment with the version of gribr without the file descriptor fix and the test script you gave me works for both files as long as I use the '2d' and '2t' shortName that my system outputs. I would think that it should fail the same way yours does. I think what I will do is push out the file descriptor fix and see if that ends up fixing your issue. Out of curiosity, what gribr and eecodes versions do you have and what OS are you running on?

smu commented 3 years ago

I use gribr version v1.2.2 with eccodes version 2.14.1 on a high-performance computing system (Red Hat Enterprise Linux ComputeNode release 7.7) . gribr was not compiled by myself (and so far I fail to compile the source), but is loaded via the modules system.

nawendt commented 3 years ago

Thanks. Luckily, conda-forge still has eccodes 2.14.1 available. My original tests were done with 2.19.1, which is currently the latest version. Tests with the same version you have did produce the same error where the second grib_select failed even though I made sure the shortName was correct. I looked into the eccodes changelog since 2.14.1 and noticed that they fixed a grib_index_delete issue where files were not actually being closed (see issue ECC-440 here). That sure sounds like the exact issue you are having. I will still have my file descriptor leak fix out today, but I think updating to at least eccodes 2.19.0 will solve your problem. With that in mind, I think I am likely to start requiring at least 2.19.0 for this package as that is a pretty significant problem for processing multiple files with grib_select.

smu commented 3 years ago

Thanks a lot, that will most likely solve this issue. Our sysadmin will recompile the package next week, but I think this bug can be marked as solved.