get_next_msg() failure - Githubissues

jintaglee commented 7 years ago

Hello,

I am trying to read Bufr messages in a Bufr file which contains IASI data. I am doing repeated "bfr.get_next_msg()" call but for some reason at 3000th Bufr message "bfr.get_next_msg()" fails with the following error message,

/home/548/jtl548/da/ops/bufr/bufrexplore/tst.py in () 4 5 for i in range(bfr.num_msgs-1): ----> 6 bfr.get_next_msg() 7 print bfr.msg_loaded

/projects/access/da/utilities/bufr/pybufr-ecmwf/install/lib/python2.7/site-packages/pybufr_ecmwf/bufr.pyc in get_next_msg(self) 183 self.bufr_obj.setup_tables(self.table_b_to_use, self.table_c_to_use, 184 self.table_d_to_use, self.tables_dir) --> 185 self.bufr_obj.decode_data() 186 187 #nsub = int(self.bufr_obj.get_num_subsets())

/projects/access/da/utilities/bufr/pybufr-ecmwf/install/lib/python2.7/site-packages/pybufr_ecmwf/bufr_interface_ecmwf.pyc in decode_data(self) 1023 lines = self.get_fortran_stdout() 1024 self.display_fortran_stdout(lines) -> 1025 raise e 1026 else: 1027 if self.verbose:

EcmwfBufrLibError: Sorry, call to bufrex failed, reported fortran error(s)"

If you would like to have a look I have uploaded a copy of the Bufr file on our anonymous ftp server,

ftp server: ftp.bom.gov.au user: anonymous directory: /anon/home/cawcr/10day/jinlee/IASIG_1.bufr

Any help would be greatly appreciated.

Regards,

Jin

jdkloe commented 7 years ago

Hi Jin, thanks for your report. May I conclude from your message that everything works well upto bufr message 2999? This very much sounds like a memory leak. I just downloaded your sample file (feel free to delete it again), and will take a look if there is something I can do for you. Jos

jdkloe commented 7 years ago

Hi Jin, I noticed that the bufr messages in this file use local B-table definitions like 055023. These are not present in the default WMO tables provided with the software. Can you provide me with a copy of these tables or a link where I can download them? Thanks, Jos

jintaglee commented 7 years ago

Hi Jos,

Thank you for looking into this.

The local B Table, "B0000000000074011001.TXT" is actually identical to the table, "B0000000000098013001.TXT". It looks like our IT person who looks after Bufr tables simply created a symlink so that B0000000000074011001.TXT -> B0000000000098013001.TXT. I have uploaded a copy of the B Table, "B0000000000074011001.TXT" (and corresponding D Table) to the ftp server for you to have a look at.

To answer your question about everything working well up to and including the message 2999: yes, and when get_next_msg() tries to get the message 3000 it fails.

About your initial thought that I could be experiencing a memory leak: I did the same test on a compute node with 30 GByte of memory and the test failed with the same error message.

Please let me know if I need to do any testing at my end.

Cheers,

Jin

jdkloe commented 7 years ago

Hi Jin,

thanks for the information about the tables. Seems a rather awkward workaround to me, but anyway it seems to work. As for the bug you reported. The good news is that I can now reproduce it. The bad news is that I also could establish that the problem is not in the python code, but must be in the underlying fortran library. Debugging this may require a significant amount of time. Meanwhile, as a workaround, it is possible to move the reader forward so it starts at for example message 1000. This way, I can read message 3000 but now it crashes at message 4000.... Creating a loop in python to reopen the bufr file multiple times with different start messages is no solution however, since the fortran code seems to keep some global variables in memory which cause the problem. As a workaround, you could create a small python script that calls another python script to read the file with varying message offsets. I think that would work. Key here is that you need to create a new process to read a new batch of bufr messages. Then the fortran code will be re-initialised.

To get the actual number of messages in the file and to set the offset you can use:

br = BUFRReader(bufrfile, warn_about_bufr_size=False)
n = br._rbf. get_num_bufr_msgs()
print('num. bufr. msgs = ', n)
br._rbf.last_used_msg = 999

Now this loop will start at message 1000:

for msg in br:
    do something

Ofcourse I hope to find the root cause of the problem and will report back here if I find it. Meanwhile I'll leave this issue open. Cheers, Jos

jdkloe commented 7 years ago

I found the bug. There was a fortran global variable in a common block in the ECMWF library code that did not always get initialised before use because it was after an if-statement. I coded a workaround and am testing it now. Hope to fix this soon for you.

jdkloe commented 7 years ago

Hi Jin, I just pushed commit #502 to github, and this should solve your problem. On my side it reads the IASI file with 10756 messages without problems now. Let me know if this works for you. Cheers, Jos

jintaglee commented 7 years ago

Hi Jos,

Thank you very much for that fix. I built and installed the newer version of pybufr_ecmwf (version 0.82) and tested my IASI bufr file. It looks like pybufr_ecmwf is now able to read beyond 3000 messages.

One thing I notice is that with this newer version there have been some changes from version 0.81 which is the version I have been working with so far. For example, the attribute 'bufr_obj' of the class, "BUFRReaderBUFRDC" seems to have been removed. Here's a screenshot demonstrating this,

bfr=BUFRReader('/g/data/dp9/jtl548/cylc-run/u-am948/share/cycle/20160529T0000Z/ukv_get_bufr/iasi/IASIG_1.bufr') bfr.get_next_msg() bfr.msg_index 1 bfr.bufr_obj.decode_sections_012()

AttributeError Traceback (most recent call last)

in () ----> 1 bfr.bufr_obj.decode_sections_012() AttributeError: BUFRReaderBUFRDC instance has no attribute 'bufr_obj' I have written a module and a number of Python scripts based on pybufr_ecmwf and they probably need modifications. Your advice on how best to proceed from 0.81 to 0.82 would be much appreciated. Cheers, Jin

jdkloe commented 7 years ago

Hi Jin,

thanks for confirming. yes, there has been some reshuffling of code. The lower level functions can still be used if you need them, but they are now in:

bfr=BUFRReader('bufrfile')
bfr.get_next_msg()
bfr.msg._bufr_obj.decode_sections_012()

so below the msg object that you get by looping over the BUFRReader instance, there is a private object _bufr_obj that contains all the low level stuff.

Ofcourse in the ideal case you should not need to handle low-level functions yourself. If you have suggestions I would be happy to add options/medhods to the main interface to make the things possible that you need.

Best regards,

Jos

jintaglee commented 7 years ago

Hi Jos,

Thank you for explaining the changes. Much appreciated.

Cheers,

Jin

jdkloe / pybufr-ecmwf

get_next_msg() failure #12

bfr=BUFRReader('/g/data/dp9/jtl548/cylc-run/u-am948/share/cycle/20160529T0000Z/ukv_get_bufr/iasi/IASIG_1.bufr') bfr.get_next_msg() bfr.msg_index 1 bfr.bufr_obj.decode_sections_012()