Closed bmschmidt closed 8 years ago
same here, an outlier shown in the first row of result. and even i remove the first row in dataset, the outlier will still happen in the new first row of the new generated 2D data....
Sounds like a line-encoding issue.
@lferry007 used Linux box for development, so he may not be accounting for Mac OS line ending differences in the code. It is a simple fix.
Confirmed something about OS seems likely. I just ran the same file on my Linux machine, and it does work there. What would the fix be?
@bmschmidt Try to reduce the sample size. Works for me in some cases.
@asxzy: Yeah, eliminating most of the points does make it run; 100K and 25K points fail, but with just 5K points the program is able to run with normal-looking results on OS X. (The real data here is 1 to 5 million points--any smaller and T-SNE is fine for my purposes.)
Hard to see where the point of failure is, though. A single run takes too long for it to be practical to find if its an individual row or a threshold size where it breaks.
EDIT- Ah, didn't understand you meant the 'samples' argument. Entering low values there does not eliminate the problem for me. Test knn accuracy, though, reported as 99.81%.
Seeing similar behavior on a Windows OS.
@asxzy is this the -samples parameter? Just confirming how it is supposed to be set up- if I have 1 million records, -samples is set to 1e6/1e8 by default, correct? I've been trying this without luck so I may have misunderstood something.
Also I have noticed that test knn accuracy has been reporting 0% when I get this result.
@bmschmidt The link of your data set does not work now. Could you provide a new link? BTW, we've fixed some of the bugs and you may rerun the code on your data set. If there's still a problem, please let us now.
Thanks, Jian https://sites.google.com/site/pkujiantang/home
I just pulled and retested and still (unfortunately) experiencing this issue. I've restored a (smaller) failing sample file under OS X to that link.
We checked the data. It is the problem of data format. In your first line, it indicates there are 1000 lines of data. We change it to 99, and the visualization looks ok now. To generate meaningful visualizations, you may need to upload the entire data set.
Jian
Oh no, in my rush to replace the online I uploaded a truncated version. This was not the original problem. I'm terribly sorry to have wasted your time on that file, and wouldn't blame you for giving up on me now.
If not, I replaced that link with the full 10,000 row set, and it still fails to compile. Pasted below is console output of different behavior under Ubuntu 14.04 and Mac OS X. As you can see, the same SHA1 hash on the input file produces good results in Ubuntu, but bad ones in OS X.
bschmidt@sibelius:~/LargeVis/Linux$ openssl sha1 failing_largevis.txt
SHA1(failing_largevis.txt)= a4b3b7893a5b8f4ef29d82901ae3cb5f654a9abd
bschmidt@sibelius:~/LargeVis/Linux$ wc -l failing_largevis.txt
10001 failing_largevis.txt
bschmidt@sibelius:~/LargeVis/Linux$ head -1 failing_largevis.txt
10000 640
bschmidt@sibelius:~/LargeVis/Linux$ ./LargeVis -input failing_largevis.txt -output t.txt -outdim 2 -samples 100
Reading input file failing_largevis.txt ...... Done.
Total vertices : 10000 Dimension : 640
Normalizing ...... Done.
Running ANNOY ...... Done.
Running propagation 3/3
Test knn accuracy : 99.81%
Computing similarities ...... Done.
Fitting model Alpha: 0.000700 Progress: 99.930%
bschmidt@sibelius:~/LargeVis/Linux$ head t.txt
10000 2
-14.740220 -5.766308
-1.294758 3.144548
-4.635156 9.079961
-9.019074 -10.023488
16.997780 25.454086
-25.735485 -14.697220
-3.328367 -14.550834
-4.309145 -9.262841
-37.779449 6.267919
-bash-3.2$ openssl sha1 failing_largevis.txt
SHA1(failing_largevis.txt)= a4b3b7893a5b8f4ef29d82901ae3cb5f654a9abd
-bash-3.2$ wc -l failing_largevis.txt
10001 failing_largevis.txt
-bash-3.2$ ./LargeVis -input failing_largevis.txt -output t.txt -outdim 2 -samples 100
Reading input file failing_largevis.txt ...... Done.
Total vertices : 10000 Dimension : 640
Normalizing ...... Done.
Running ANNOY ...... Done.
Running propagation 3/3
Test knn accuracy : 99.81%
Computing similarities ...... Done.
Fitting model Alpha: 0.000700 Progress: 99.930%
bash-3.2$ head t.txt
10000 2
-76.238503 -7.682394
0.147972 0.035222
0.147989 0.035224
0.147941 0.035219
0.148011 0.035226
0.147880 0.035213
0.147879 0.035213
0.147913 0.035216
0.147693 0.035194
I have the same issue using Ubuntu 15.10. Has anyone solved this yet?
After running Largevis on my dataset, the first point is orders of magnitude larger than the remaining points after dimensionality reduction, resulting in a meaningless plot.
root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012# ./LargeVis -input 1k_points.txt -output 1k_2d.txt Reading input file 1k_points.txt ...... Done. Total vertices : 1000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.98% Computing similarities ...... Done. Fitting model Alpha: 0.000100 Progress: 99.993% root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012#
root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012# head 1k_2d.txt 1000 2 -31.457289 -0.287726 12.466423 -0.287530 12.466411 -0.287530 12.466626 -0.287530 12.466501 -0.287530 12.466530 -0.287530 12.466509 -0.287530 12.466496 -0.287530 12.466705 -0.287530
Here is a link to the input data: https://www.dropbox.com/s/bvup56przujg52d/1k_points.txt?dl=0
And a link to the Largevis output: https://www.dropbox.com/s/jk2p0qof2sn7hr9/1k_2d.txt?dl=0
Any guidance would be greatly appreciated. Thank you!
Guys - @spamcatcher345 posted his data on my git, so I took a look and tried it out with my implementation. In the first place, the data that was uploaded doesn't have 64 dimensions, it has 58. I don't know whether that would affect the output with @lferry007's implementation. Also, the data has 18 duplicates.
When I run my implementation of largeVis on it, what I get looks gaussian.
How was this data generated?
Using this data on Ubuntu 14.04 and @lferry007's implementation, I find that
On OS X, it also works when properly labeled.
What do you mean by a "good" embedding?
By "good," I mean it does not have the first point as an extreme outlier with no structure in all later points; instead, it looks like T-SNE does when run on data that has some internal structure. So validating by looking at it: here's thje plot.
A mighty "slap" rang out across the land. ./facepalm
Somewhere along the way, while moving files around VMs/hosts/networks, I somehow managed to drop the last 6 columns when creating the 1k sample data... Testing again with 58 dimensions worked in my Ubuntu 15.10 VM, compiled with the current version of GSL. Once it worked with the 1k sample data, I went back to the original data set (which DOES have 64 dimensions) and uncovered a few other self-induced issues caused by the formatting/etl I was doing. Happy to report success with 20k rows of input data. Next step is scaling up to 1-10M+ rows.
A few notes I've made along the way, in case it may help someone else in the future:
--Individual vectors must have digits of precision. Eg: Zeros (0) must be represented as 0.000000 --Execution will fail (core dump) if you have MORE than the actual number of dimensions defined in the first line of the input file. --Execution will not fail if you have LESS than the actual number of dimensions defined in the first line of the input file. (Up to a certain point. When I define 5 dimensions on a set with 64, it fails. When I define 25 dimensions on a set with 64, it succeeds. Not sure what this does to the end result, but it cant be "right") ^^^Suggest bug fix/enhancement here - either better error checking & description or, better yet, parse the file for the correct number of rows & dimensions as part of execution --Generally speaking, there is a need for descriptive error handling. Any number of issues with data or config or otherwise result in the same screen output: Segmentation fault (core dumped) . Advice for others: Be absolutely meticulous.
Thanks again @elbamos , @bmschmidt and @lferry007 ! This has gotten me past the bottleneck of t-SNE for this project. Oh, @elbamos - this is being generated from spam email data, labeled by k-means.
Hello again @bmschmidt, @lferry007, @elbamos & all,
Could someone please test this sample data with their own installation of Largevis and let me know if it works for you?
https://www.dropbox.com/s/fky8qec7mf3elfe/50k_500c_1000i_points.zip?dl=0
I've been been going rounds with this, and no luck so far. At some indeterminate point, dependent on the data set, the dimensionality reduction fails and reverts to the symptoms above.
TLDR below, so that you know i've done my homework before asking....
Thanks again in advance.
.................................................. --make sure the number of rows specified in the infile header row exactly matches number of rows in the file wc -l 50k_500c_1000i_points.txt 50000 50k_500c_1000i_points.txt
--make sure all rows have the same number of features while read i; do echo $i | tr ' ' '\n' | wc -l; done < 50k_500c_1000i_points.txt | sort | uniq -c | sort -nr 50000 64
--make sure all values are between 0-1, no funny chars, text, etc. root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | head 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | tail 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
--Run Largevis on all 50k rows (results in garbage out)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 50k_2d 50000 2 -93.259697 92.923935 0.004132 -0.007910 0.005585 -0.003991 0.005309 -0.004095 0.005638 -0.003910 0.003409 -0.001608 0.005987 -0.003725 0.001725 -0.003395 0.002095 -0.003146
--So, lets try a sample set of the first 10k rows from the same file
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -10000 50k_500c_1000i_points.txt > 10k_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ll
-rwxrwx--- 1 root vboxsf 7050000 Oct 26 16:36 10k_500c_1000i_points.txt -rwxrwx--- 1 root vboxsf 35250000 Oct 19 12:49 50k_500c_1000i_points.txt
--Run Largevis on the first 10k rows (results in success)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 10k_500c_1000i_points.txt -output 10k_2d -samples 10 Reading input file 10k_500c_1000i_points.txt ...... Done. Total vertices : 10000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.99% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 10k_2d 10000 2 11.317126 -16.769339 1.494954 -14.870223 5.892348 -8.776334 -15.464440 1.080700 -4.194473 13.608239 -4.832858 -10.284916 18.223331 0.966998 13.740998 -1.130792 -10.511693 12.002342
--Okay, lets try 25k rows now (results in garbage out)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 25k_500c_100i_points.txt -output 25k_2d -samples 10 Reading input file 25k_500c_100i_points.txt ...... Done. Total vertices : 25000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.83% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 25k_2d 25000 2 -97.675056 87.540543 -0.002500 -0.015151 0.006426 -0.008289 0.006369 -0.008438 0.006459 -0.008278 0.010856 -0.011586 0.006840 -0.008010 0.007054 -0.009687 0.006959 -0.009642
Try with 12k, 15k, 17k, 18k, 19k, and so forth...
It doesnt appear to be caused by the number of rows or number of dimensions. I've run the mnist data successfully, as well as randomly generated 100-dimensional x 50,000 rows of both integers and floats. I've tried changing digits of precision from 2-8 with no effect.
What I'm struggling with is that the cutoff point at which it starts producing garbage out is dependent on the data set. I've had successful runs of 20k rows from a different dataset, but this one stops working after 17998 rows:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -17999 50k_500c_1000i_points.txt > 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# vi 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17998_500c_1000i_points.txt -output 17998_2d -samples 10 Reading input file 17998_500c_1000i_points.txt ...... Done. Total vertices : 17998 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.96% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 17998_2d 17998 2 -4.273000 -5.816042 -9.945149 12.071325 6.342212 6.314149 5.711284 -5.599425 -14.465376 -2.408209 -1.881844 -0.892462 -7.527262 12.491118 2.002883 -5.556159 13.538328 6.924161
It core dumps on 17999 rows:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17999_500c_1000i_points.txt -output 17999_2d -samples 10 Reading input file 17999_500c_1000i_points.txt ...... Done. Total vertices : 17999 Dimension : 64 Normalizing ...... Done. Running ANNOY ......Segmentation fault (core dumped)
And at 18000 rows, its back to garbage out:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 18k_2d 18000 2 -123.057167 -62.150906 0.013629 0.007195 0.010898 0.006385 0.010933 0.006196 0.010904 0.006396 0.004929 0.011840 0.010916 0.006575 0.011646 0.003596 0.011509 0.003789
If it doesn't visualize with the reference implementation it's exceedingly unlikely that it would work with my implementation. If you'd like to give it a try it should take about 10 minutes to code reading in this data and another 15 to get results.
On Oct 26, 2016, at 7:30 PM, spamcatcher345 notifications@github.com wrote:
Hello again @bmschmidt, @lferry007, @elbamos & all,
Could someone please test this sample data with their own installation of Largevis and let me know if it works for you?
https://www.dropbox.com/s/fky8qec7mf3elfe/50k_500c_1000i_points.zip?dl=0
I've been been going rounds with this, and no luck so far. At some indeterminate point, dependent on the data set, the dimensionality reduction fails and reverts to the symptoms above.
TLDR below, so that you know i've done my homework before asking....
Thanks again in advance.
--make sure the number of rows specified in the infile header row exactly matches number of rows in the file wc -l 50k_500c_1000i_points.txt
50000 50k_500c_1000i_points.txt
--make sure all rows have the same number of features while read i; do echo $i | tr ' ' '\n' | wc -l; done < 50k_500c_1000i_points.txt | sort | uniq -c | sort -nr
50000 64
--make sure all values are between 0-1, no funny chars, text, etc. root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | head 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | tail 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000
--Run Largevis on all 50k rows (results in garbage out)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 50k_2d 50000 2 -93.259697 92.923935 0.004132 -0.007910 0.005585 -0.003991 0.005309 -0.004095 0.005638 -0.003910 0.003409 -0.001608 0.005987 -0.003725 0.001725 -0.003395 0.002095 -0.003146
--So, lets try a sample set of the first 10k rows from the same file
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -10000 50k_500c_1000i_points.txt > 10k_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ll
-rwxrwx--- 1 root vboxsf 7050000 Oct 26 16:36 10k_500c_1000i_points.txt -rwxrwx--- 1 root vboxsf 35250000 Oct 19 12:49 50k_500c_1000i_points.txt
--Run Largevis on the first 10k rows (results in success)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 10k_500c_1000i_points.txt -output 10k_2d -samples 10 Reading input file 10k_500c_1000i_points.txt ...... Done. Total vertices : 10000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.99% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 10k_2d 10000 2 11.317126 -16.769339 1.494954 -14.870223 5.892348 -8.776334 -15.464440 1.080700 -4.194473 13.608239 -4.832858 -10.284916 18.223331 0.966998 13.740998 -1.130792 -10.511693 12.002342
--Okay, lets try 25k rows now (results in garbage out)
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 25k_500c_100i_points.txt -output 25k_2d -samples 10 Reading input file 25k_500c_100i_points.txt ...... Done. Total vertices : 25000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.83% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 25k_2d 25000 2 -97.675056 87.540543 -0.002500 -0.015151 0.006426 -0.008289 0.006369 -0.008438 0.006459 -0.008278 0.010856 -0.011586 0.006840 -0.008010 0.007054 -0.009687 0.006959 -0.009642
Try with 12k, 15k, 17k, 18k, 19k, and so forth...
It doesnt appear to be caused by the number of rows or number of dimensions. I've run the mnist data successfully, as well as randomly generated 100-dimensional x 50,000 rows of both integers and floats. I've tried changing digits of precision from 2-8 with no effect.
What I'm struggling with is that the cutoff point at which it starts producing garbage out is dependent on the data set. I've had successful runs of 20k rows from a different dataset, but this one stops working after 17998 rows:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -17999 50k_500c_1000i_points.txt > 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# vi 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17998_500c_1000i_points.txt -output 17998_2d -samples 10 Reading input file 17998_500c_1000i_points.txt ...... Done. Total vertices : 17998 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.96% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 17998_2d 17998 2 -4.273000 -5.816042 -9.945149 12.071325 6.342212 6.314149 5.711284 -5.599425 -14.465376 -2.408209 -1.881844 -0.892462 -7.527262 12.491118 2.002883 -5.556159
13.538328 6.924161
It core dumps on 17999 rows:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17999_500c_1000i_points.txt -output 17999_2d -samples 10 Reading input file 17999_500c_1000i_points.txt ...... Done. Total vertices : 17999 Dimension : 64 Normalizing ...... Done. Running ANNOY ......Segmentation fault (core dumped)
And at 18000 rows, its back to garbage out:
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 18k_2d 18000 2 -123.057167 -62.150906 0.013629 0.007195 0.010898 0.006385 0.010933 0.006196 0.010904 0.006396 0.004929 0.011840 0.010916 0.006575 0.011646 0.003596 0.011509 0.003789
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Hi all, We've updated the code and you can have a try. If there is still a problem, feel free to contact us. Thanks!
@lferry007 I just went through the diff - am I correct that there are no changes to the implementation of the algorithm after the neighbor search step? Thanks
On Nov 2, 2016, at 10:02 AM, lferry007 notifications@github.com wrote:
Hi all, We've updated the code and you can have a try. If there is still a problem, feel free to contact us. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I can confirm from my side that the same file produces meaningful output after updating.
Awesome!
Thanks very much @lferry007 !!
=====================UPDATE 20161102 ==============================
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# md5sum 50k_500c_1000i_points.txt 32f71633ef2a83764258c0378e9906b5 50k_500c_1000i_points.txt
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210%
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# head 50k_2d 50000 2 -126.974144 31.734627 0.011018 0.004690 0.004896 0.003125 0.004526 0.002717 0.005311 0.003130 0.003361 0.000967 0.006694 0.003260 0.003402 -0.003218 0.005849 -0.003446
================Updated to new version=================
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# md5sum 50k_500c_1000i_points.txt 32f71633ef2a83764258c0378e9906b5 50k_500c_1000i_points.txt
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# ./Largevis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 99.91% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210%
root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# head 50k_2d 50000 2 1.729691 -6.422237 -10.387295 1.105057 -10.262017 4.892396 -8.165269 7.217187 -1.564149 -9.366737 7.682181 -4.121569 -4.760568 13.656388 -4.644851 -8.798579 2.579137 10.852392
@spamcatcher345 Glad to know it works! Thanks!
Jian
This fixes the outlier problem on my previously-broken test data as well. Thanks!
I'm able to reduce the mnist data reduction as in the description, but the output of the program for my own data is producing nonsense. Any leads appreciated. Running on OS X.
The output of LargeVis is one outlier point, with all other points in a very tightly-clumped diagonal line: first 10 lines of the file are here. All the remaining points are right next to the last 8 points here.
Perhaps I'm not understanding some particularity of the input data format? Here's what a sample of that looks like: the format appears to me the same as the mnist, except that I have negative numbers. The full test set (20MB) is here.