Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression

Ma et al., TPAMI 2007

In this paper, based on ideas from lossy data coding and compression, we present a simple but effective technique for segmenting multivariate mixed data that are drawn from a mixture of Gaussian distributions, which are allowed to be almost degenerate. The goal is to find the optimal segmentation that minimizes the overall coding length of the segmented data, subject to a given distortion. By analyzing the coding length/rate of mixed data, we formally establish some strong connections of data segmentation to many fundamental concepts in lossy data compression and rate distortion theory. We show that a deterministic segmentation is approximately the (asymptotically) optimal solution for compressing mixed data. We propose a very simple and effective algorithm which depends on a single parameter, the allowable distortion. At any given distortion, the algorithm automatically determines the corresponding number and dimension of the groups and does not involve any parameter estimation. Simulation results reveal intriguing phase-transition-like behaviors of the number of segments when changing the level of distortion or the amount of outliers. Finally, we demonstrate how this technique can be readily applied to segment real imagery and bioinformatic data.

🔑 Key idea:

A seminary paper on the rate-distortion theory, which leads to the classic book, "Elements of Information Theory."

✏️ Summary:

Loose coding function $L(V) = \frac{m+n}{2} \log_2 \det \big( I + \frac{n}{\epsilon^2 m} V V^\intercal \big)$ in Eqn. 6 provides a tight bound on the coding length when the underlying distribution is a degenerate or nondegenerate Gaussian.
Geometrically minimizing the coding length is equivalent to reducing the dimension of the data. Remind that the geometrical interpretation of determinant. (Appendix A)
Geometrical interpretation (Sec. 3). When each vector is distorted by Gaussian noise, the volume of the region spanned by these vectors is proportional to the square root of the determinant of that covariance matrix. Similarly, the volume spanned by each noise vector is proportional to $\sqrt{\det(\frac{\epsilon^2}{n} I)}$. Then, the number of spheres partitioned by the nonoverlapping spheres can decide the number of bits to encode the information, which derives the same equation of $L(V)$ (Eqn. 11).

🤔 Confidence:

High

✏️ Memo:

Elements of Information Theory (2nd Ed.) (Cover & Thomas, 2006)

jnhwkim / Pensees

Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression #4

Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression

🔑 Key idea:

✏️ Summary:

🤔 Confidence:

✏️ Memo: