BUG: Ordering clusters in Gaussian mixtures lead to unexpected behaviours

Describe the issue:

The Gaussian mixture classes (WeightedGaussianMixture, GaussianMixtureWithPrior) have methods that reorder the clusters, like order_clusters_GM_weights that orders them based on their weights, or order_clusters that orders them as in the gmmref. These methods are called by constructors or by other methods, like update_gmm_with_priors.

These reordering of the clusters introduces unwanted behaviours, specially with other objects that rely on the order in which the original clusters have been specified. One example of this would be the fixed_membership argument of the PGI_UpdateParameters class (see #1221). Another one would be when we let the GMM parameters to be updated by the PGI_UpdateParameters directive. This class allows us to specify confidence values for means, covariances and proportions (kappa, nu, zeta, respectively), which should also be arrays. The shape and order of this arrays should be consistent with the original arrays for the means, covariances and proportions.

If we have chosen to update the GMM parameters after each iteration, a new GMM class is initialized by PGI_UpdateParameters, whose clusters will be reordered, making it inconsistent with the order of the original GMM, and therefore with the order of these confidences.

Is there a special need for reordering the clusters? If not, would it be better to remove these methods?

Reproducable code example:

from discretize import TreeMesh
from discretize.utils import active_from_xyz
import numpy as np
import SimPEG.potential_fields as pf
from SimPEG import (
    data_misfit,
    directives,
    inverse_problem,
    inversion,
    maps,
    optimization,
    regularization,
    utils,
)
from SimPEG.utils import io_utils

# Reproducible science
np.random.seed(518936)

# Load Mesh
mesh_file = io_utils.download(
    "https://storage.googleapis.com/simpeg/pgi_tutorial_assets/mesh_tutorial.ubc"
)
mesh = TreeMesh.read_UBC(mesh_file)

# Load True geological model for comparison with inversion result
true_geology_file = io_utils.download(
    "https://storage.googleapis.com/simpeg/pgi_tutorial_assets/geology_true.mod"
)
true_geology = mesh.read_model_UBC(true_geology_file)

# Load geophysical data
data_grav_file = io_utils.download(
    "https://storage.googleapis.com/simpeg/pgi_tutorial_assets/gravity_data.obs"
)
data_grav = io_utils.read_grav3d_ubc(data_grav_file)
data_mag_file = io_utils.download(
    "https://storage.googleapis.com/simpeg/pgi_tutorial_assets/magnetic_data.obs"
)
data_mag = io_utils.read_mag3d_ubc(data_mag_file)

# Load Topo
topo_file = io_utils.download(
    "https://storage.googleapis.com/simpeg/pgi_tutorial_assets/CDED_Lake_warp.xyz"
)
topo = np.genfromtxt(topo_file, skip_header=1)
# find the active cells
actv = active_from_xyz(mesh, topo, "CC")
# Create active map to go from reduce set to full
ndv = np.nan
actvMap = maps.InjectActiveCells(mesh, actv, ndv)
nactv = int(actv.sum())

# Create simulations and data misfits
# Wires mapping
wires = maps.Wires(("den", actvMap.nP), ("sus", actvMap.nP))
gravmap = actvMap * wires.den
magmap = actvMap * wires.sus
idenMap = maps.IdentityMap(nP=nactv)

# Grav problem
simulation_grav = pf.gravity.simulation.Simulation3DIntegral(
    survey=data_grav.survey,
    mesh=mesh,
    rhoMap=wires.den,
    ind_active=actv,
)
dmis_grav = data_misfit.L2DataMisfit(data=data_grav, simulation=simulation_grav)

# Mag problem
simulation_mag = pf.magnetics.simulation.Simulation3DIntegral(
    survey=data_mag.survey,
    mesh=mesh,
    chiMap=wires.sus,
    ind_active=actv,
)
dmis_mag = data_misfit.L2DataMisfit(data=data_mag, simulation=simulation_mag)

#########################################################################
# Create a joint Data Misfit
#

# Joint data misfit
dmis = 0.5 * dmis_grav + 0.5 * dmis_mag

# initial model
m0 = np.r_[-1e-4 * np.ones(actvMap.nP), 1e-4 * np.ones(actvMap.nP)]

gmmref = utils.WeightedGaussianMixture(
    n_components=3,  # number of rock units: bckgrd, PK, HK
    mesh=mesh,  # inversion mesh
    actv=actv,  # actv cells
    covariance_type="diag",  # diagonal covariances
)
gmmref.fit(np.random.randn(nactv, 2))
gmmref.means_ = np.c_[
    [0, 0.1],  # HK
    [0.0, 0.0],  # BCKGRD density contrast and mag. susc
    [-1, 0.0],  # PK
].T
gmmref.covariances_ = np.array(
    [
        [2.4e-03, 1.5e-06],
        [6e-04, 3.175e-07],
        [2.4e-03, 1.5e-06],
    ]
)
# important after setting cov. manually: compute precision matrices and cholesky
gmmref.compute_clusters_precisions()
# set global proportions; low-impact as long as not 0 or 1 (total=1)
gmmref.weights_ = np.r_[0.025, 0.9, 0.075]

# Create PGI regularization
# Sensitivity weighting
wr_grav = np.sum(simulation_grav.G**2.0, axis=0) ** 0.5 / (mesh.cell_volumes[actv])
wr_grav = wr_grav / np.max(wr_grav)

wr_mag = np.sum(simulation_mag.G**2.0, axis=0) ** 0.5 / (mesh.cell_volumes[actv])
wr_mag = wr_mag / np.max(wr_mag)

# create joint PGI regularization with smoothness
reg = regularization.PGI(
    gmmref=gmmref,
    mesh=mesh,
    wiresmap=wires,
    maplist=[idenMap, idenMap],
    active_cells=actv,
    alpha_pgi=1.0,
    alpha_x=1.0,
    alpha_y=1.0,
    alpha_z=1.0,
    reference_model=utils.mkvc(
        gmmref.means_[gmmref.predict(m0.reshape(actvMap.nP, -1))]
    ),
    weights_list=[wr_grav, wr_mag],  # weights each phys. prop. by correct sensW
)

# Directives
# Add directives to the inversion
# ratio to use for each phys prop. smoothness in each direction:
# roughly the ratio of range of each phys. prop.
alpha0_ratio = np.r_[
    1e-2 * np.ones(len(reg.objfcts[1].objfcts[1:])),
    1e-2 * 100.0 * np.ones(len(reg.objfcts[2].objfcts[1:])),
]
Alphas = directives.AlphasSmoothEstimate_ByEig(alpha0_ratio=alpha0_ratio, verbose=True)
# initialize beta and beta/alpha_s schedule
beta = directives.BetaEstimate_ByEig(beta0_ratio=1e-4)
betaIt = directives.PGI_BetaAlphaSchedule(
    verbose=True,
    coolingFactor=2.0,
    tolerance=0.2,
    progress=0.2,
)
# geophy. and petro. target misfits
targets = directives.MultiTargetMisfits(
    verbose=True,
    chiSmall=0.5,  # ask for twice as much clustering (target value is /2)
)
# add learned mref in smooth once stable
MrefInSmooth = directives.PGI_AddMrefInSmooth(
    wait_till_stable=True,
    verbose=True,
)
# update the parameters in smallness (L2-approx of PGI)
update_smallness = directives.PGI_UpdateParameters(
    update_gmm=True,  # update the GMM each iteration
    kappa=np.array(
        [  # confidences in each mean phys. prop. of each cluster
            [1e10, 1e10],  # fixed HK
            [1e10, 1e10],  # fixed BCKGRD
            [1, 1],  # allow PK to move
        ]
    ),
)
# pre-conditioner
update_Jacobi = directives.UpdatePreconditioner()
# iteratively balance the scaling of the data misfits
scaling_init = directives.ScalingMultipleDataMisfits_ByEig(chi0_ratio=[1.0, 100.0])
scale_schedule = directives.JointScalingSchedule(verbose=True)

# Create inverse problem
# Optimization
# set lower and upper bounds
lowerbound = np.r_[-2.0 * np.ones(actvMap.nP), 0.0 * np.ones(actvMap.nP)]
upperbound = np.r_[0.0 * np.ones(actvMap.nP), 1e-1 * np.ones(actvMap.nP)]
opt = optimization.ProjectedGNCG(
    maxIter=30,
    lower=lowerbound,
    upper=upperbound,
    maxIterLS=20,
    maxIterCG=100,
    tolCG=1e-4,
)
# create inverse problem
invProb = inverse_problem.BaseInvProblem(dmis, reg, opt)
inv = inversion.BaseInversion(
    invProb,
    # directives: evaluate alphas (and data misfits scales) before beta
    directiveList=[
        Alphas,
        scaling_init,
        beta,
        update_smallness,
        targets,
        scale_schedule,
        betaIt,
        MrefInSmooth,
        update_Jacobi,
    ],
)
# Invert
pgi_model_no_info = inv.run(m0)

# Checkout location of the means in the updated gmm
print(reg.gmm.means_)

# Checkout confidences values for the means
print(update_smallness.kappa)

Error message:

The updated means are:

[[-7.86662499e-14  2.56498856e-14]                                                                                                
 [-1.00000000e+00  2.20776927e-13]                                                                                                
 [ 0.00000000e+00  1.00000000e-01]]

The confidences for the means are:

[[1.e+10 1.e+10]                                                                                                                  
 [1.e+10 1.e+10]                                                                                                                  
 [1.e+00 1.e+00]]

The learned means have been rearranged, but the confidence values kept the original order, creating an inconsistency between them.

Runtime information:

--------------------------------------------------------------------------------
  Date: Mon May 15 16:11:11 2023 PDT

                OS : Linux
            CPU(s) : 12
           Machine : x86_64
      Architecture : 64bit
               RAM : 15.6 GiB
       Environment : IPython
       File system : ext4

  Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC
  11.3.0]

            SimPEG : 0.19.0
        discretize : 0.8.3
       pymatsolver : 0.2.0
             numpy : 1.23.5
             scipy : 1.10.1
           sklearn : 1.2.2
        matplotlib : 3.7.1
           empymod : 2.2.1
            geoana : 0.4.0
            pandas : 1.5.3
            pydiso : 0.0.3
             numba : 0.56.4
              dask : 2023.3.1
             sympy : 1.11.1
           IPython : 8.11.0
        ipywidgets : 8.0.4
            plotly : 5.13.1
               vtk : 9.2.6
               utm : 0.7.0
   memory_profiler : 0.61.0
--------------------------------------------------------------------------------

Context for the issue:

No response

just for the record (have not looked in detail): ordering is necessary when each cluster is updated based on a different prior set of parameters.

Just taking this occasion to voice a concern: I lack time to review those PRs lately. However, I want to emphasize that the PGI code, while not the most beautiful indeed, and with its own errors, have been tested on many case study on my end, with many edges cases that have been built into the code (but maybe without adding tests for all those edges cases).

My concerns is that those edges cases fixes and behaviours are not taken into account in those PRs. And it will be hard to do so, I understand. But take it as a warning to take extra precaution when a behavior is not obvious.

Thanks @thibaut-kobold for you replies!

just for the record (have not looked in detail): ordering is necessary when each cluster is updated based on a different prior set of parameters.

I'd appreciate if you elaborate more on this. I suspect you mean that when fitting the GaussianMixtureWithPrior, we need to guarantee that we are keeping the same membership indices for each unit. How are we ensuring that?

Just taking this occasion to voice a concern: I lack time to review those PRs lately. However, I want to emphasize that the PGI code, while not the most beautiful indeed, and with its own errors, have been tested on many case study on my end, with many edges cases that have been built into the code (but maybe without adding tests for all those edges cases).

My concerns is that those edges cases fixes and behaviours are not taken into account in those PRs. And it will be hard to do so, I understand. But take it as a warning to take extra precaution when a behavior is not obvious.

I understand your point. I just want to clarify that every PR I title "Refactor ...", I'm actually performing refactors of the code: modifying the code to be more readable, eliminating some repeated lines, improving docs, moving class attributes to arguments of the constructor, removing unused variables, etc.; but not changing its behaviour. This is why I just highlight potential issues (see my comment on #1221 regarding fix_membership) without jumping to apply a fix in that same PR.

The same reasoning goes into this bug report. I just found an unwanted behaviour related to an inconsistency between the confidences defined in the directive and the original GMM parameters (mean, covariances, weights). Before jumping to apply a change to quickly fix it, I opened an Issue so we can discuss about it, I can learn more about the underlying processes of PGI, and we can ultimately think of a solution for it.

I'll go into more technical details related to this issue in another comment.

PS: this is not a call to stop what you are doing, on the contrary :) this is more a highlight of my short-coming when it came to address issues arising from case studies but not adding specific test for it.

I'd appreciate if you elaborate more on this. I suspect you mean that when fitting the GaussianMixtureWithPrior, we need to guarantee that we are keeping the same membership indices for each unit. How are we ensuring that?

GaussianMixtureWithPrior: Yes, that is what is referred to here. This tutorial is likely the best publicly available example: each cluster has a specific role to play: https://docs.simpeg.xyz/content/tutorials/14-pgi/plot_inv_2_joint_pf_pgi_no_info_tutorial.html#sphx-glr-content-tutorials-14-pgi-plot-inv-2-joint-pf-pgi-no-info-tutorial-py

About fixed_membership: a bit fuzzy in my head, but I think that it can likely be removed (just check it is not what is used in the directives AddMrefInSmooth.). This was an earlier attempt at keeping only some cells at a fixed membership, but now the better way is through the weights_ of the GMM, as it can take an array of shape [n_active_cells, k_clusters]

Since the reproducable example is quite long, it's easy to lose the details. I'll write down a summary here.

When defining the original GMM, I'm manually setting the means, covariances and weights of the three rock units. The rock units are in the following order: HK, BCKGRD and PK.

gmmref.means_ = np.c_[
    [0, 0.1],  # HK
    [0.0, 0.0],  # BCKGRD density contrast and mag. susc
    [-1, 0.0],  # PK
].T
gmmref.covariances_ = np.array(
    [
        [2.4e-03, 1.5e-06],
        [6e-04, 3.175e-07],
        [2.4e-03, 1.5e-06],
    ]
)
...
gmmref.weights_ = np.r_[0.025, 0.9, 0.075]

A few lines later, I define the PGI regularization, to which I pass the gmm I created:

# create joint PGI regularization with smoothness
reg = regularization.PGI(
    gmmref=gmmref,
    mesh=mesh,
    wiresmap=wires,
    maplist=[idenMap, idenMap],
    active_cells=actv,
    alpha_pgi=1.0,
    alpha_x=1.0,
    alpha_y=1.0,
    alpha_z=1.0,
    reference_model=utils.mkvc(
        gmmref.means_[gmmref.predict(m0.reshape(actvMap.nP, -1))]
    ),
    weights_list=[wr_grav, wr_mag],  # weights each phys. prop. by correct sensW
)

Later on, I need to define the PGI_UpdateParameters directive. I choose to update the means of the GMM, so I need to pass an array for the kappa confidences. I do so in the same order I defined the rock units in the original GMM:

# update the parameters in smallness (L2-approx of PGI)
update_smallness = directives.PGI_UpdateParameters(
    update_gmm=True,  # update the GMM each iteration
    kappa=np.array(
        [  # confidences in each mean phys. prop. of each cluster
            [1e10, 1e10],  # fixed HK
            [1e10, 1e10],  # fixed BCKGRD
            [1, 1],  # allow PK to move
        ]
    ),
)

I'm going to allow only the PK means locations to be changed, while fixing the means of the other two rock units.

After running the inversion, the updated GMM that lives inside the PGI regularization has the following means:

[[-7.86662499e-14  2.56498856e-14]                                                                                                
 [-1.00000000e+00  2.20776927e-13]                                                                                                
 [ 0.00000000e+00  1.00000000e-01]]

These are not ordered as I originally defined the rock units, but now it's BCKGRD, PK and HK. The issue is that the kappa values in the directive haven't changed:

[[1.e+10 1.e+10]                                                                                                                  
 [1.e+10 1.e+10]                                                                                                                  
 [1.e+00 1.e+00]]

So, we end up having an unwanted behaviour: the rock unit that will be changed will be the HK and not PK.

I narrowed down the issue a little bit. The ordering is done when defining the PGI regularization. The PGI.gmmref attribute is a copy of the original GMM that gets ordered by PGI's constructure (see https://github.com/simpeg/simpeg/blob/e309f2fbe53bc01d7f5b822c4afd04548b5c7d66/SimPEG/regularization/pgi.py#L710C28-L711).

So, one way to overcome this issue would be to get the actual order of the rock units after we define the PGI regularization, and define the confidence values accordingly to it.

For this particular example, the PGI.gmm.means_ after defining the PGI object look like this:

[[ 0.   0. ]
 [-1.   0. ]
 [ 0.   0.1]]

GaussianMixtureWithPrior: Yes, that is what is referred to here. This tutorial is likely the best publicly available example: each cluster has a specific role to play: https://docs.simpeg.xyz/content/tutorials/14-pgi/plot_inv_2_joint_pf_pgi_no_info_tutorial.html#sphx-glr-content-tutorials-14-pgi-plot-inv-2-joint-pf-pgi-no-info-tutorial-py

I actually used that example to build the reproducable code for this issue. This bug doesn't show up in it because the weights are defined in decreasing order, so the ordering doesn't make any change that would create any inconsistency between the GMM parameters and the confidences. When the weights aren't ordered (or some units have the same weights), the inconsistency appears.

About fixed_membership: a bit fuzzy in my head, but I think that it can likely be removed (just check it is not what is used in the directives AddMrefInSmooth.). This was an earlier attempt at keeping only some cells at a fixed membership, but now the better way is through the weights_ of the GMM, as it can take an array of shape [n_active_cells, k_clusters]

Good to know. I think we could open a new issue/PR to remove it.

simpeg / simpeg