TSNE Emits NaN values - Githubissues

ccjernigan commented 2 years ago

Describe the bug

Call TSNE with certain datasets using the default parameter values.
Check the output of TSNE

Expected behavior A valid matrix with real numbers is emitted from TSNE.

Actual behavior The entire output consists of NaN

Code snippet

val tsne = tsne(X = x, d = 2, perplexity = 20.0, eta = 200.0, iterations = 1000)
checkMatrix(tsne.coordinates, "tsne") // Throws because entire output is NaN values

private fun checkMatrix(matrix: Array<DoubleArray>, name: String) {
    for (x in matrix.indices) {
        for (y in matrix[x].indices) {
            checkDouble(matrix[x][y], "$name[$x][$y]")
        }
    }
}

private fun checkDouble(value: Double, name: String) {
    require(!value.isNaN()) { "$name must be a real number" }
    require(value.isFinite()) { "$name must be a finite value" }
}

I've started trying to narrow the issue down further, by modifying the TSNE class and checking representation invariants at the end of the update method's for-loop with this helper:

    private void checkrep() {
        checkMatrix(gains, "gains");
        checkMatrix(P, "P");
        checkMatrix(Q, "Q");

        checkDouble(Qsum, "QSum");
        checkDouble(cost, "cost");

        checkMatrix(coordinates, "coordinates");
    }

    private void checkMatrix(double[][] matrix, String name) {
        for (int x = 0; x < matrix.length; x++) {
            for (int y = 0; y < matrix[x].length; y++) {
                checkDouble(matrix[x][y], String.format("%1$s[%2$d][%3$d]", name, x, y));
            }
        }
    }

    private void checkDouble(double value, String name) {
        if (Double.isNaN(value)) {
            throw new AssertionError(String.format(Locale.US, "%1$s is NaN", name));
        }

        if (Double.isInfinite(value)) {
            throw new AssertionError(String.format(Locale.US, "%1$s is infinite", name));
        }
    }

The result of this shows that Iteration 1, P[0][14] is infinite

My current best guess is either an overflow or a divide by zero is happening inside the TSNE implementation.

Input data The input data is from a proprietary dataset with 38 columns and about 2,000 rows. I'm working on seeing if I can get a small data file that will reproduce the issue and will follow up, however I didn't want to further delay reporting this issue.

If I do PCA first on the data, the bug still reproduces. (Output from PCA works as expected, but then passing that to TSNE will fail).

Additional context

OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
Smile 2.6.0
macOS 12.0.1
Data is able to be processed with sklearn

haifengl commented 2 years ago

Can you share your data row 0 and 14?

ccjernigan commented 2 years ago

Sure thing, here's the two rows.

Row 0 546.0,699.0,614.5,36.73192130222964,19.23344825856491,50.13487128884382,42.5531914893617,6.382978723404255,1.0876422321777017,554.0,697.0,610.9083333333333,32.91496449876838,3.353788776840883,46.48834075690951,0.0,0.0,60.11130395539003,5.189328896622921,16.156840534773153,0.25,156.1500519000612,12.344921925244897,83.84315946522686,0.0625,810.3139765341565,64.06186007322721,0.0,0.010026041666666667,0.109375,0.02516286133852306,2617.0,3386.0,4406.0,512.2377377741707,0.0,0.5333333333333333,0.0

Row 14 554.0,820.0,656.6521739130435,67.87839367945013,26.078513931757858,94.20673124867663,37.77777777777778,15.555555555555555,2.2479107751292027,569.0,814.0,671.5666666666667,81.18737879861918,3.6656479358929115,114.78703090935839,0.0,0.0,275.5071544828891,2.7563723809213436,26.6214288306189,0.15625,804.6452417233597,17.128663336240933,73.37857116938112,0.0625,2217.901920726047,47.212974542114544,0.0,0.0013020833333333333,0.0234375,0.0050603066591062874,3875.0,4916.166666666667,9414.0,2209.53655019931,0.16666666666666666,0.13333333333333333,0.0

Note that these rows may not be exactly the same as my input to TSNE. My program queries a database and constructs the rows on-the-fly. The rows I've included in this comment were printed to CSV right before being passed into TSNE. So they should be very close, but possibly not exact (e.g. a decimal here or there might be different if there's any floating point imprecision).

On my end, I would need to try running the TSNE algorithm by loading the printed CSV to confirm they still reproduce the issue. I can do that tomorrow.

ccjernigan commented 2 years ago

A few additional updates: If I normalize the data beforehand, then it'll run. Is normalization required?

harryyu1018 commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。您的邮件我已收到，我会尽快阅读并回复您。祝您工作顺利，心情愉悦！

haifengl commented 2 years ago

The distance between these two samples are quite large, which may cause the overflow for searching the kernel width. Since the columns have very different magnitude, it is helpful to normalize your data first.

ccjernigan commented 2 years ago

Before this is considered resolved, can I offer two constructive suggestions:

Consider adding API docs that let callers know normalization is expected
Can the TSNE implementation be made to fail-fast if an overflow occurs?

CaelumF commented 2 years ago

Something that confuses me is that when I add a Double array of NaNs to the end of the input, the output of tsne at least appears to be working again

haifengl / smile

TSNE Emits NaN values #702