isrish / EM_WD

Weighted Data EM
2 stars 1 forks source link

Is the definition of "weight" in EM_WDF .m correct? #2

Open d-kitamura opened 2 weeks ago

d-kitamura commented 2 weeks ago

Hi, I'm using your code. But I have a question about the definition of weight in EM_WDF.m. I generated 100 samples from the clusters A and B, and 50 samples from the cluster C. These clusters are all 2D normal distribution. Then I set the weights of samples to 1 for the samples from A and B, and 2 for the samples from C. The number of samples from C is 50, which is just half compared with those from A or B (100). So, I set the weight for the samples from C to 2. This leads to that all the clusters are treated with equal weight, because the weight can be interpreted as the times of observation as described in the paper. However, the result was different from my expectation. The fitted cluster becomes very small contribution ratio. On the other hand, when I set the weight for the samples from C to 0.5 (the inverse of 2), I got the result that I expected. Could you tell me why this happens?

Here is my checking code.

`clear; close all; clc; addpath("./util/"); rng(1); % For reproducibility

%% Produce weighted observed data using 2D Gaussian distributions % Set average vector and covariance matrix mu1 = [-1; 4]; sigma1 = [2 0; 0 0.5]; mu2 = [-3; -3]; sigma2 = [1 0;0 1]; mu3 = [2; -2]; sigma3 = [1 0.8;0.8 2];

% Random sampling nSample1 = 100; % number of data samples for the first distribution nSample2 = 100; % number of data samples for the second distribution nSample3 = 50; % number of data samples for the third distribution nSampleAll = nSample1 + nSample2 + nSample3; data1 = mvnrnd(mu1, sigma1, nSample1); data2 = mvnrnd(mu2, sigma2, nSample2); data3 = mvnrnd(mu3, sigma3, nSample3); obsData = [ % nSapmleAll x 2 data1; data2; data3 ];

% Set weights of data weight1 = 1; weight2 = 1; weight3 = 2; obsDataWeight = [ % fixed weight weight1 ones(nSample1, 1); weight2 ones(nSample2, 1); weight3 * ones(nSample3, 1) ];

% Show scatter graph figure; grpInd = [ones(nSample1, 1); 2ones(nSample2,1); 3ones(nSample3,1)]; % group indicator gscatter(obsData(:, 1), obsData(:,2), grpInd); % group-wise scatter graph xlim([-8, 8]); ylim([-8, 8]); title("Observed data"); xlabel("First dimension"); ylabel("Second dimension"); grid on;

%% Gaussian mixture model with fixed weighted data % Fit model nDist = 3; % number of distributions assumed modelWdf = EM_WDF(obsData, obsDataWeight, nDist); model = gmdistribution( ... % convert modelWdf to "gmdistribution" instance modelWdf.mu.', ... % apply transpose to mu because its porperty definition in the gmdistribution class is different from the output object of EM_WDF function modelWdf.Sigma, ... modelWdf.PComponents ... );

% Show results figure; modelPdf = @(x, y) arrayfun(@(x0, y0) pdf(model, [x0, y0]), x, y); fsurf(modelPdf); % 2D surface plot xlim([-8, 8]); ylim([-8, 8]); title("Gaussian mixture model with fixed data weight"); xlabel("First dimension"); ylabel("Second dimension"); grid on; view(2); % top view %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% EOF %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

isrish commented 1 week ago

I think your interpretation of the weighted data is right but the way you get the samples data doesn't fit " independent but not identically distributed." since your sampling scheme is still IID.