Closed ccjernigan closed 2 years ago
Can you share your data row 0 and 14?
Sure thing, here's the two rows.
Row 0
546.0,699.0,614.5,36.73192130222964,19.23344825856491,50.13487128884382,42.5531914893617,6.382978723404255,1.0876422321777017,554.0,697.0,610.9083333333333,32.91496449876838,3.353788776840883,46.48834075690951,0.0,0.0,60.11130395539003,5.189328896622921,16.156840534773153,0.25,156.1500519000612,12.344921925244897,83.84315946522686,0.0625,810.3139765341565,64.06186007322721,0.0,0.010026041666666667,0.109375,0.02516286133852306,2617.0,3386.0,4406.0,512.2377377741707,0.0,0.5333333333333333,0.0
Row 14
554.0,820.0,656.6521739130435,67.87839367945013,26.078513931757858,94.20673124867663,37.77777777777778,15.555555555555555,2.2479107751292027,569.0,814.0,671.5666666666667,81.18737879861918,3.6656479358929115,114.78703090935839,0.0,0.0,275.5071544828891,2.7563723809213436,26.6214288306189,0.15625,804.6452417233597,17.128663336240933,73.37857116938112,0.0625,2217.901920726047,47.212974542114544,0.0,0.0013020833333333333,0.0234375,0.0050603066591062874,3875.0,4916.166666666667,9414.0,2209.53655019931,0.16666666666666666,0.13333333333333333,0.0
Note that these rows may not be exactly the same as my input to TSNE. My program queries a database and constructs the rows on-the-fly. The rows I've included in this comment were printed to CSV right before being passed into TSNE. So they should be very close, but possibly not exact (e.g. a decimal here or there might be different if there's any floating point imprecision).
On my end, I would need to try running the TSNE algorithm by loading the printed CSV to confirm they still reproduce the issue. I can do that tomorrow.
A few additional updates: If I normalize the data beforehand, then it'll run. Is normalization required?
这是来自QQ邮箱的假期自动回复邮件。 您的邮件我已收到,我会尽快阅读并回复您。祝您工作顺利,心情愉悦!
The distance between these two samples are quite large, which may cause the overflow for searching the kernel width. Since the columns have very different magnitude, it is helpful to normalize your data first.
Before this is considered resolved, can I offer two constructive suggestions:
Something that confuses me is that when I add a Double array of NaNs to the end of the input, the output of tsne at least appears to be working again
Describe the bug
Expected behavior A valid matrix with real numbers is emitted from TSNE.
Actual behavior The entire output consists of
NaN
Code snippet
I've started trying to narrow the issue down further, by modifying the TSNE class and checking representation invariants at the end of the
update
method's for-loop with this helper:The result of this shows that
Iteration 1, P[0][14] is infinite
My current best guess is either an overflow or a divide by zero is happening inside the TSNE implementation.
Input data The input data is from a proprietary dataset with 38 columns and about 2,000 rows. I'm working on seeing if I can get a small data file that will reproduce the issue and will follow up, however I didn't want to further delay reporting this issue.
If I do PCA first on the data, the bug still reproduces. (Output from PCA works as expected, but then passing that to TSNE will fail).
Additional context