jdkloe / pybufr-ecmwf

a python module that allows reading and writing BUFR formatted files, where BUFR stands for Binary Universal Form for the Representation of meteorological data.
Other
29 stars 12 forks source link

Add high level BUFRReader class. #4

Closed cpaulik closed 8 years ago

cpaulik commented 8 years ago

This class is a read only interface that offers iteration over BUFR messages and returns both data and units as dictionaries. It guesses the size of the decoded message which can lead to memory problems if e.g. no BUFR Table is found for a file. So it might be smart to enforce a upper limit at which the guessing stops.

I'm not sure if a dictionary is the best interface. Some people might prefer just a 2D array and use numeric indices to get the data. A dictionary on the other hand leads to code that can be easily understood but might be overly long if the variable names in the BUFR files are very long.

cpaulik commented 8 years ago

I'll work on the Python 3 errors when it is decided if the dict interface is a good one.

jdkloe commented 8 years ago

thanks again for adding this new interface. I think it is important to make the module more convenient to use, so I will accept your PR. I would appreciate it if you could also fix the python3 compatibility. Also thanks for updating the unittests.

However, I have at least 2 issues with your code that should be addressed in a later update. 1) your code introduces a new place where the low level bufrex routine is called. I would prefer to keep this in a single location, to make sure any workarounds needed to get it working will benefit all code that uses it. So I would prefer if the actual decoding is replaced by calling the BUFRInterfaceECMWF class, in pybufr_ecmwf/bufr_interface_ecmwf.py or to calling the BUFRReader reader class in pybufr_ecmwf/bufr.py (I can do that myself in a later update if you don't mind). As you can see in pybufr_ecmwf/bufr.py I also made a small start to use the new eccodes library for decoding, so if you would use pybufr_ecmwf/bufr.py then this new BUFRReader class will be able to use that one as well. 2) you convert the retrieved data to a 2D array. Surely this is convenient, but it is not a generic solutuion. Many BUFR messages/files in the wild use delayed replication, and that makes the retrieved data irregular. To not confuse users, this should be made very clear I think in the documentation in this new class.

cpaulik commented 8 years ago

Thank you for your pointers.

1) I'll gladly use the centralized interface. 2) Do you have any BUFR file with an example of the delayed replication? I have no experience with it.

Just a a side note. There is no reason to merge unfinished code into the master branch before it is done. I can add additional commits to any PR if there are things that have to be fixed. But I'll open a new one for the outstanding changes.

jdkloe commented 8 years ago

I tried to find an example bufr file in my collection, but did not find any with this property. Seems to me there may be some (unwritten?) convention against using this option.... Anyway, the file format standard defines it, and the ECMWF bufrdc library implements it, so I think we should be prepared to handle such files. So I created a little test/example script to generate a bufr file that uses delayed replication, and pushed it to github. See: example_programs/delayed_replication_example.py If you run it manually it will generate a little bufr file (and some matching local bufr tables) with delayed replication, and with different replication counts for each subset.

cpaulik commented 8 years ago

@jdkloe Thanks. In the changes in #5 I actually had to change the test for the iteration to the included file from ERS since the other two complained about using delayed replication when calling get_values_as_2d_array. At the moment I've documented that the iteration over messages only works if it is possible to represent a message as a 2D array. Otherwise the error is raised.