jdkloe / pybufr-ecmwf

a python module that allows reading and writing BUFR formatted files, where BUFR stands for Binary Universal Form for the Representation of meteorological data.
Other
29 stars 12 forks source link

Move message iterator to classes in bufr.py. #5

Closed cpaulik closed 8 years ago

cpaulik commented 8 years ago

This should address your points and integrates the message iteration interface nicely with the already existing classes.

Closes #3

jdkloe commented 8 years ago

thanks for your work on this, looks very nice. Maybe it would be usefull to add some code to actually check if a bufr message can be represented as 2D array, and allow use of the get_values_as_2d_array if this is the case? Then the GOME and GRAS file tests could be included as well. What do you think?

Oh, and just ignore the AppVeyor failed message. This has never worked, since I have not yet found a way to compile the base library on windows. Should be possible in theory though.

cpaulik commented 8 years ago

Yes, a real check would be good. I'll check this next week and integrate it into the get_values_as_2d_array function. My first idea would be to just try and reshape the array. If that fails we could still raise the error.

jdkloe commented 8 years ago

Just looping over the subsets and calling the get_subset_values() method from the BUFRReader class should give you a list of values for each subset. Checking if the length of this list is constant should not be too difficult. Appending them to one long list, converting to a numpy array, and reshaping them may also work, although if datatypes (numbers and text) are mixed in the 1D arrays, you will end up with a 2D numpy array with datatype string (I'm not sure if that would be very usefull). This is a very common situation for many bufr files.

jdkloe commented 8 years ago

small addition: you probably can just take the looping that is inside get_values_as_2d_array method in the BUFRReader class as example, and replace get_values() by a loop over get_value(). The text I mentioned in my previous comment only can occur if you switch on the autoget_cval option to self.bufr_obj.get_value. If that one is set to False any text in the bufr message will be indicated by an index number to the cvals array only, so maybe the reshape is an option after all. Ofcourse there may be a performance penaly if you do a double loop explicitely in python code, but I would say, first lets get this working, and then we can worry about optimisation if necessary.

cpaulik commented 8 years ago

I was now playing around a bit with the GRAS BUFR file and the test data file that you produced. I noticed that the GRAS file actually can not be represented with a 2D array even though it only has one subset. It can however be represented as a dictionary of arrays.

Is there a unique identifier of a field in a BUFR file? I'm asking because I noticed that the cnames are not guaranteed to be unique. For example in the file SXH58EUSR199812162225 the name MISSING PACKED COUNTER occurs three times. So it is not possible to return a dictionary with the names as keys.

jdkloe commented 8 years ago

Unfortunately, no there is no guarantee that names should be unique in a BUFR template. In fact, as soon as you get repeated blocks you will get doubles. This may happen if replication or delayed replication is used, or even if the same B or D descriptor is used multiple times in the same template. Maybe adding a simple counter for each unique name could be a solution? i.e. append some separator character ('@' or similar) and the the counter as a string to the name of the item to make it unique? Then they could be used as dictionary key. Ofcourse retrieving the items then will be more difficult for the user...

cpaulik commented 8 years ago

Hmm, that is unfortunate. Adding a character to make the name unique might be possible for the ERS files but for the GRAS file we get a (1,7210) array for the values and also a 7210 element list for the names. If there is no guarantee that unique names should be grouped then we can not decide anything at this point. I guess we must return the numpy arrays directly and can not use a dictionary since it seems like knowledge about the file is needed to use it further.

jdkloe commented 8 years ago

Yes, I agree some knowledge about the bufr file/template will be needed is many cases before the file can actually be used. So let's stick to a simple numpy array then.

cpaulik commented 8 years ago

The new draft is finished. The messages iterator now returns data, names, units as numpy.ndarray, list, list. If there is delayed replication it iterates over the subsets. So a user only has to check if the data array is 1D or 2D to use it further. In the test cases I only print the shape and len of the returned datasets to keep the expected output files easier to read. If you think that this is not a good enough test we can revert to printing the full data.

In a second commit I improved the performance of getting a 2D array. I think there is no need to loop over all the elements since we can directly reshape and slice the array that was read.

jdkloe commented 8 years ago

thanks, looks good to me, merging your PR. Reporting of the sizes in the unittest expected output is just fine I think. The important thing is that we'll notice that decoding fails in case future modifications break this interface.