pypest / pyemu

python modules for model-independent uncertainty analyses, data-worth analyses, and interfacing with PEST(++)
BSD 3-Clause "New" or "Revised" License
168 stars 94 forks source link

Miss-alignment in tpl indices and original file #507

Open wkitlasten opened 1 month ago

wkitlasten commented 1 month ago

Hey,

Can you point me to a simple example that uses par_type="constant" and use_rows=[list-o-lines] within pf.add_parameters() so I can break it?! A bit below, but I will spare anyone else the pain of looking at the rest of my mess!

I am using this to parameterize all lines in some files and it seems to work mostly as expected:

pf.add_parameters(filenames=fname, par_type="grid",
              par_name_base=pargp, pargp=pargp,
              use_cols=[2],
              upper_bound=par_df.loc[t,'upper_bound'],
              lower_bound=par_df.loc[t,'lower_bound'],
              ult_ubound=par_df.loc[t,'ult_ubound'],
              ult_lbound=par_df.loc[t,'ult_lbound'],
              transform = par_df.loc[t, 'partrans'],
              par_style=par_style, index_cols=0,
              initial_value=initial_value, 
              mfile_skip=1)

The file formerly known as fname (.csv):

ifno,kind,value,i,j
1,inflow,4000000.0,44,0
1109,inflow,34843.7,45,0
1110,inflow,8283.5355,42,0
1135,inflow,22508.228,47,6
1161,inflow,7076.489500000001,46,17
...

If I leave mfile_skip=1 out it tries to parameterize the header (e.g., parnme = pname:sfr.inflow_inst:0_ptype:gr_usecol:2_pstyle:m_idx0:ifno which is obviously wrong!

If I have mfile_skip=1 in the call ifno is shifted by -1 from the original model input file (e.g. parnme = pname:sfr.inflow_inst:1_ptype:gr_usecol:2_pstyle:m_idx0:0). If I leave mfile_skip out of the call and use index labels rather than locations for use_cols and index_cols I get a similar result (ifno shifted -1). I don't understand this, but maybe that is okay, it ain't the first time. That produces the following .tpl file.

,sidx,idx_strs,pargp2,parval1_2,2
0,"(0,)",idx0:0,sfr.inflow,1.0,~       p0       ~
1,"(1108,)",idx0:1108,sfr.inflow,1.0,~       p1       ~
2,"(1109,)",idx0:1109,sfr.inflow,1.0,~       p2       ~
...

I do something similar for other files, but with a subset of lines. It seems the use_rows arg requires positional indexing, hence I use positional for use_cols and index_cols too and require the mfile_skip=1 to avoid parameterizing the header:

use_rows = df[df['cp_layer'] == lay].index.tolist() # [0,38,40,42,44...]
pf.add_parameters(filenames=fname, par_type="constant",                                       
                                  par_name_base=f"{pargp}_{lay}", pargp=pargp,
                                  use_cols=4,
                                  upper_bound=par_df.loc[t,'upper_bound'],
                                  lower_bound=par_df.loc[t,'lower_bound'],
                                  ult_ubound=par_df.loc[t,'ult_ubound'],
                                  ult_lbound=par_df.loc[t,'ult_lbound'],
                                  transform = par_df.loc[t, 'partrans'],
                                  par_style=par_style, index_cols=[0],
                                  initial_value=initial_value, use_rows=use_rows,
                                              mfile_skip=1)

That builds the following .tpl file, from the original file which is as expected.

ptf ~
,sidx,idx_strs,pargp4,parval1_4,4
0,"('123_0',)",idx0:123_0,cp_kh,1.0,~  pname:cp_kh_1_inst:0_ptype:cn_usecol:4_pstyle:m  ~
38,"('191_0',)",idx0:191_0,cp_kh,1.0,~  pname:cp_kh_1_inst:0_ptype:cn_usecol:4_pstyle:m  ~
40,"('191_2',)",idx0:191_2,cp_kh,1.0,~  pname:cp_kh_1_inst:0_ptype:cn_usecol:4_pstyle:m  ~
...

original file:
Name          x                                   y ...
123_0 1679938.8593249405 5413977.31979702 ...
123_1 1679938.8593249405 5413977.31979702 ...
180_0 1680281.4905325829 5401861.768442976...
180_1 1680281.4905325829 5401861.768442976 ...
...

Pest builds fine from pst_from with the above bits. But when I try to pyemu.helpers.apply_list_and_array_pars(arr_par_file='mult2model_info.csv') I get the following error related to the 2nd add_parameters call:

AssertionError: Probable miss-alignment in tpl indices and original file:
mult idx[:10] : [1230.0, 1910.0, 1912.0, 1914.0, 1916.0, 1918.0, 1920.0, 1922.0, 1924.0, 1926.0]
org file idx[:10]: ['123_0', '123_1', '180_0', '180_1', '180_10', '180_11', '180_12', '180_13', '180_14', '180_15']
n common: 0, n cols: 1, expected: 46.0.

From line 2213 in helpers.py

mlts
Out[4]: 
         Unnamed: 0         sidx     idx_strs pargp4  parval1_4    4
0                                                                   
1230.0            0   ('123_0',)   idx0:123_0  cp_kh        1.0  1.0
1910.0           38   ('191_0',)   idx0:191_0  cp_kh        1.0  1.0
1912.0           40   ('191_2',)   idx0:191_2  cp_kh        1.0  1.0
...

new_df
Out[5]: 
        oidx                   1                  2
0                                                  
123_0      0  1679938.8593249405   5413977.31979702 ...
123_1      1  1679938.8593249405   5413977.31979702 ...
180_0      2  1680281.4905325829  5401861.768442976 ...
180_1      3  1680281.4905325829  5401861.768442976 ...
 ...

So clearly the issue is in common_idx = (new_df.index.intersection(mlts.index).drop_duplicates()) but I can't track how mlts gets its index or how I can adjust my pf builds to ensure it lines up with "new_df".

Any suggestions would be helpful.

wkitlasten commented 1 month ago

Update, I'm even more confused, shocking I know. Suspecting the _ was somehow being replace by a decimal in the mlts.index I changed the name, but I get the same errror. But in this case they seem to be aligned?

AssertionError: Probable miss-alignment in tpl indices and original file:
mult idx[:10] : ['123n0', '123n1', '180n0', '180n1', '180n10', '180n11', '180n12', '180n13', '180n14', '180n15']
org file idx[:10]: ['123n0', '123n1', '180n0', '180n1', '180n10', '180n11', '180n12', '180n13', '180n14', '180n15']
n common: 576, n cols: 1, expected: 0.0.

Next step, made the index an int, same issue. I doubt I could misunderstand what is going on more! They look aligned to me. I think the only thing left is the sep?

AssertionError: Probable miss-alignment in tpl indices and original file:
mult idx[:10] : [1230000, 1230001, 1800000, 1800001, 1800002, 1800003, 1800004, 1800004, 1800006, 1800007]
org file idx[:10]: [1230000, 1230001, 1800000, 1800001, 1800002, 1800003, 1800004, 1800004, 1800006, 1800007]
n common: 518, n cols: 1, expected: 0.0.
wkitlasten commented 1 month ago

Got it to work without any use_rows arg and grid scale, not constant as intended.

briochh commented 1 month ago

Seems there is a few things here @wkitlasten.

-1 shifting This is a consequence of the zero_based arg passed to the parent pstfrom object, it is the same as when you index by k,i,j. It should be safe for pyemu alignment at runtime -- you just need to remember to add the 1 back in any post processing. There could be an argument for supporting a modification of this arg per add_parameters()and add_observations()calls -- but this is also not without it's drawbacks, not least keeping track of which pars and obs are zero-based.

mfile_skip This just tells pyemu how many initial model input file line to skip when reading. If your headers are sane (same delimiter and same number of cols as the rest of the file) you shouldn't need it and you can pass index_cols='ifno'directly (you will still get the -1 adjustment to the integer index values, as above).

use_rows There are some hints in the docs(!) (unexpected, I know!). So if Theo object passed to use_rows has a single dimension (e.g. [0, 4, 7]) it is interpreted as a locational (kinda like pandas iloc). If you want to match on index col values (as they are in the model input file you can pass an object with 2 dimensions. This feels a bit weird of cases like yours where you only have one index_col, but for you it might be something like [(1,), (1109,), (1110,), ...] -- this should behave a little more like pandas .loc. Note: when passing this way the values are expected to be in the model base (so actual index col values in the model input file, prior to any -1 adjustment)

misalignment This one looks like it might need a bit more digging -- if it is still not working after checking the other points above, feel free send thought that file and I can take a look.

wkitlasten commented 1 month ago

Thanks for that. The main issue was the use_rows = [(1,)...] bit and inadvertantly mixing positional with label indexes (which confounded the first two).

I read the hints in the docs but it just wouldn't penetrate my thick skull. Not that it makes any more sense that what is written, but perhaps a more explicit explanation could be added to the docstring? Something like:

"For use_rows with a single 'index_cols' use [('a',),('b',),('c',)] to set parameters for rows with model file index entries of a,b,c."

wkitlasten commented 1 month ago

This issue arises from line 2099 of utils.helpers() when index names have _ in them. Apparently 123_1 is a "convenient" way to write 1231.0 (e.g., https://stackoverflow.com/questions/54009778/what-do-underscores-in-a-number-mean).

Probably a more explicit check for non-numeric parts of the index (that are not .) is in order?