meteokid / python-rpn

Python RPN (RPNpy) is a collection of Python modules and scripts developed at RPN for scientific use.
GNU Lesser General Public License v2.1
9 stars 4 forks source link

Possibly useful utility functions? #15

Open neishm opened 5 years ago

neishm commented 5 years ago

I have some optimized routines from the fstd2nc tool which I've found helpful when scanning through many (hundreds) of files on a routine basis, where even a small overhead can add up to a noticeable delay in the script. Maybe some of them might be useful within python-rpn as utility functions? Below is a brief description of them.

all_params

The all_params method extracts all the record parameters at once, and returns a vectorized dictionary of the result. It scrapes the information directly out of the librmn data structures, and avoids the overhead of repeatedly calling fstprm. Example:

In [1]: import os, sys

In [2]: import rpnpy.librmn.all as rmn

In [3]: from fstd2nc.extra import all_params

In [4]: ATM_MODEL_DFILES = os.getenv('ATM_MODEL_DFILES').strip()

In [5]: fileId = rmn.fstopenall(ATM_MODEL_DFILES+'/bcmk', rmn.FST_RO)

In [6]: p = all_params(fileId)

In [7]: print p
{'deet': array([900, 900, 900, ..., 900, 900, 900], dtype=int32), 'typvar': array(['P ', 'P ', 'P ', ..., 'P ', 'P ', 'X '], dtype='|S2'), 'lng': array([3762, 3762, 3762, ..., 3762, 3762,   13], dtype=int32), 'ni': array([200, 200, 200, ..., 200, 200,   1], dtype=int32), 'nj': array([100, 100, 100, ..., 100, 100,   1], dtype=int32), 'nbits': array([12, 12, 12, ..., 12, 12, 32], dtype=int8), 'swa': array([   2335,    6097,    9859, ..., 4040669, 4044431, 4048193],
      dtype=uint32), 'datyp': array([1, 1, 1, ..., 1, 1, 5], dtype=uint8), 'xtra1': array([354514400, 354514400, 354514400, ..., 354525200, 354525200,
       354525200], dtype=uint32), 'xtra2': array([0, 0, 0, ..., 0, 0, 0], dtype=uint32), 'xtra3': array([0, 0, 0, ..., 0, 0, 0], dtype=uint32), 'ip2': array([ 0,  0,  0, ..., 12, 12,  0], dtype=int32), 'ip3': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ip1': array([       0, 97642568, 97738568, ..., 93423264, 93423264, 44140192],
      dtype=int32), 'key': array([      1,    1025,    2049, ..., 2145288, 2146312, 2147336],
      dtype=int32), 'ubc': array([0, 0, 0, ..., 0, 0, 0], dtype=uint16), 'npas': array([ 0,  0,  0, ..., 48, 48, 48], dtype=int32), 'nk': array([1, 1, 1, ..., 1, 1, 1], dtype=int32), 'ig4': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ig3': array([0, 0, 0, ..., 0, 0, 0], dtype=int32), 'ig2': array([   0,    0,    0, ...,    0,    0, 1600], dtype=int32), 'ig1': array([  0,   0,   0, ...,   0,   0, 800], dtype=int32), 'nomvar': array(['P0  ', 'TT  ', 'TT  ', ..., 'UU  ', 'VV  ', 'HY  '], dtype='|S4'), 'datev': array([354514400, 354514400, 354514400, ..., 354525200, 354525200,
       354525200], dtype=int32), 'dateo': array([354514400, 354514400, 354514400, ..., 354514400, 354514400,
       354514400], dtype=int32), 'etiket': array(['G133K80P    ', 'G133K80P    ', 'G133K80P    ', ...,
       'G133K80P    ', 'G133K80P    ', 'G133K80P    '], dtype='|S12'), 'grtyp': array(['G', 'G', 'G', ..., 'G', 'G', 'X'], dtype='|S1'), 'dltf': array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)}

This can be combined with pandas to get a convenient table of parameters:

In [8]: import pandas as pd

In [9]: p = pd.DataFrame(p)

In [10]: print p
          dateo      datev  datyp  deet  dltf  ...   typvar ubc      xtra1  xtra2  xtra3
0     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
1     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
2     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
3     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
4     354514400  354514400      1   900     0  ...       P    0  354514400      0      0
...         ...        ...    ...   ...   ...  ...      ...  ..        ...    ...    ...
2783  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2784  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2785  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2786  354514400  354525200      1   900     0  ...       P    0  354525200      0      0
2787  354514400  354525200      5   900     0  ...       X    0  354525200      0      0

[2788 rows x 28 columns]

Having the parameters in this pandas.DataFrame structure provides a more powerful tool for analysing the data. For instance, the pivot method could be used to quickly organize the records into multidimensional time/level structures. The method is also about 20x faster than looping over fstprm:

In [11]: %timeit map(rmn.fstprm, rmn.fstinl(fileId))
10 loops, best of 3: 84.3 ms per loop

In [12]: %timeit pd.DataFrame(all_params(fileId))
100 loops, best of 3: 4.46 ms per loop

80ms isn't much, but it adds up if you're scanning over hundreds (or thousands) of files.

maybeFST

The maybeFST function is a more compact version of isFST which avoids any librmn calls such as c_wkoffit. That library function can incur some overhead since it's testing for many different formats, not just FST.

When combined together, the maybeFST and all_params functions can allow a user to very quickly scan over many hundreds of files and get a snapshot of all the records inside.

stamp2datetime

The stamp2datetime function converts an array of RPN date stamps into datetime objects. Useful in conjunction with all_params to get date information quickly, e.g.:

In [13]: from fstd2nc.mixins.dates import stamp2datetime

In [14]: print stamp2datetime(p['datev'])
['2009-04-27T00:00:00.000000000' '2009-04-27T00:00:00.000000000'
 '2009-04-27T00:00:00.000000000' ... '2009-04-27T12:00:00.000000000'
 '2009-04-27T12:00:00.000000000' '2009-04-27T12:00:00.000000000']

In [15]: p['date'] = stamp2datetime(p['datev'])

In [16]: print p
          dateo      datev  datyp  deet         ...              xtra1 xtra2 xtra3                date
0     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
1     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
2     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
3     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
4     354514400  354514400      1   900         ...          354514400     0     0 2009-04-27 00:00:00
...         ...        ...    ...   ...         ...                ...   ...   ...                 ...
2783  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2784  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2785  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2786  354514400  354525200      1   900         ...          354525200     0     0 2009-04-27 12:00:00
2787  354514400  354525200      5   900         ...          354525200     0     0 2009-04-27 12:00:00

[2788 rows x 29 columns]

decode_ip1

The decode_ip1 function quickly decodes the levels from an array of ip1 values. For example:

In [18]: import numpy as np

In [19]: from fstd2nc.mixins.vcoords import decode_ip1

In [20]: print decode_ip1(p['ip1'])
[array([(2, 0.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 0.000125)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 0.000221)], dtype=[('kind', '<i4'), ('level', '<f4')]) ...
 array([(5, 1.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(5, 1.)], dtype=[('kind', '<i4'), ('level', '<f4')])
 array([(2, 0.1)], dtype=[('kind', '<i4'), ('level', '<f4')])]

In [21]: levels = np.concatenate(decode_ip1(p['ip1']))

In [22]: print levels
[(2, 0.00e+00) (5, 1.25e-04) (5, 2.21e-04) ... (5, 1.00e+00) (5, 1.00e+00)
 (2, 1.00e-01)]

In [23]: print levels.dtype
[('kind', '<i4'), ('level', '<f4')]

In [24]: p['level'] = levels['level']

In [25]: p['kind'] = levels['kind']

In [26]: print p
          dateo      datev  datyp  deet  dltf  ...  xtra2 xtra3                date     level  kind
0     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000000     2
1     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000125     5
2     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000221     5
3     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000382     5
4     354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000635     5
...         ...        ...    ...   ...   ...  ...    ...   ...                 ...       ...   ...
2783  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2784  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2785  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5
2786  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5
2787  354514400  354525200      5   900     0  ...      0     0 2009-04-27 12:00:00  0.100000     2

[2788 rows x 31 columns]

Example

For completion, here's an example using the information from all the above steps to get multidimensional structures for a field. First, get the list of variables:

In [32]: print pd.unique(p.nomvar)
['P0  ' 'TT  ' 'ES  ' 'HU  ' 'MX  ' 'LA  ' 'LO  ' 'ME  ' 'WE  ' 'GZ  '
 'HR  ' 'WW  ' 'PN  ' 'TD  ' 'QC  ' 'FC  ' 'FQ  ' 'IO  ' 'OL  ' 'UE  '
 'EN  ' 'GL  ' 'LH  ' 'MF  ' 'MG  ' 'TM  ' 'VG  ' 'RC  ' 'AL  ' 'I0  '
 'I1  ' 'SD  ' 'I6  ' 'I7  ' 'I8  ' 'I9  ' 'AB  ' 'AG  ' 'AH  ' 'AI  '
 'AP  ' 'AR  ' 'AS  ' 'AU  ' 'AV  ' 'AW  ' 'EV  ' 'FS  ' 'FV  ' 'IC  '
 'IE  ' 'IH  ' 'IV  ' 'NF  ' 'SI  ' 'I2  ' 'I3  ' 'I4  ' 'I5  ' 'DN  '
 'NR  ' 'HF  ' 'FL  ' 'N0  ' 'O1  ' 'RT  ' 'RY  ' 'PR  ' 'PE  ' 'FR  '
 'RN  ' 'SN  ' 'PC  ' 'FB  ' 'N4  ' 'CX  ' 'FI  ' 'AD  ' 'FN  ' 'EI  '
 'H   ' 'J9  ' 'TG  ' 'Z0  ' 'NT  ' 'K6  ' 'K4  ' 'U9  ' 'AE  ' 'RZ  '
 'RR  ' 'U4  ' 'L7  ' 'L8  ' 'IY  ' 'UU  ' 'VV  ' 'HY  ' '>>  ' '^^  '
 'VF  ' 'GA  ' 'J1  ' 'J2  ' 'Y7  ' 'Y8  ' 'Y9  ' 'ZP  ' 'META' 'TS  '
 'TP  ' 'HS  ' 'LG  ' 'ICEL' 'EMIB']

Pick UU:

In [46]: p = p.loc[p.dltf==0]   # Ignore deleted records

In [47]: uu = p.loc[p.nomvar=='UU  ']  # Select UU records

In [48]: print uu
          dateo      datev  datyp  deet  dltf  ...  xtra2 xtra3                date     level  kind
922   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000125     5
924   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000221     5
926   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000382     5
928   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.000635     5
930   354514400  354514400      1   900     0  ...      0     0 2009-04-27 00:00:00  0.001010     5
...         ...        ...    ...   ...   ...  ...    ...   ...                 ...       ...   ...
2777  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.961000     5
2779  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.974000     5
2781  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.985000     5
2783  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  0.995000     5
2785  354514400  354525200      1   900     0  ...      0     0 2009-04-27 12:00:00  1.000000     5

[160 rows x 31 columns]

Organize the records by date/level:

In [49]: uu = uu.pivot(index='date',columns='level')  # Organize by date/level

In [51]: print uu['ip1']
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00  97642568  97738568  97899568    ...     95356840  95366840  93423264
2009-04-27 12:00:00  97642568  97738568  97899568    ...     95356840  95366840  93423264

[2 rows x 80 columns]

In [52]: print uu['ip2']
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00         0         0         0    ...            0         0         0
2009-04-27 12:00:00        12        12        12    ...           12        12        12

[2 rows x 80 columns]

In [53]: print uu['key']  # The keys/handles for these record
level                0.000125  0.000221  0.000382    ...     0.985000  0.995000  1.000000
date                                                 ...                                 
2009-04-27 00:00:00   1730561   1732609   1734657    ...      2150401   2152449   2154497
2009-04-27 12:00:00   1721352   1723400   1725448    ...      2141192   2143240   2145288

[2 rows x 80 columns]
meteokid commented 5 years ago

Would you mind making a pull request for this. Thanks!

meteokid commented 5 years ago

Pulled into rpnpy_2.1-fstfast-branch, will need some more time to review this one before merging into main dev branch.

I would consider spliting this one into 3 files...

vectorize could be used elsewhere and would probably be best by itself "hack-ish" functions, relying on librmn internal structure knowledge... this one I would not import in "utils.all" other functions In any case, thanks for these optimizations.

neishm commented 5 years ago

I agree some of these should not go into "utils.all". Splitting them sounds like a good idea.