lferry007 / LargeVis

Apache License 2.0
708 stars 168 forks source link

Dimensionality reduction clumps all but one point together #8

Closed bmschmidt closed 8 years ago

bmschmidt commented 8 years ago

I'm able to reduce the mnist data reduction as in the description, but the output of the program for my own data is producing nonsense. Any leads appreciated. Running on OS X.

The output of LargeVis is one outlier point, with all other points in a very tightly-clumped diagonal line: first 10 lines of the file are here. All the remaining points are right next to the last 8 points here.

10000 2
-57.236931 0.250471
1.739228 0.140431
1.739322 0.140431
1.739219 0.140431
1.739558 0.140430
1.739269 0.140430
1.739119 0.140431
1.739207 0.140431
1.739546 0.140430

Perhaps I'm not understanding some particularity of the input data format? Here's what a sample of that looks like: the format appears to me the same as the mnist, except that I have negative numbers. The full test set (20MB) is here.

➜  Linux git:(master) ✗ head ../as_text.txt| cut -c 1-140                            
10000 640
0.068507 0.088455 0.004352 0.062336 -0.008105 -0.065166 0.005332 -0.004465 0.009418 0.053710 0.021793 0.002761 -0.045826 0.047004 -0.021048 
0.030815 0.061551 0.055325 0.014904 0.009537 0.003453 -0.041773 0.070575 0.004215 0.034589 0.026759 0.009715 -0.037361 0.003642 -0.062977 -0
0.044672 0.028437 0.024890 -0.025580 -0.002071 -0.013081 -0.038324 0.007230 0.024878 -0.006843 -0.022699 -0.018267 -0.048828 0.053914 -0.038
-0.003441 0.067980 0.047075 -0.006172 -0.017513 0.022899 0.013291 0.032307 -0.071118 -0.007152 0.019992 -0.019428 -0.069072 0.058524 0.01285
-0.010718 -0.002089 -0.008822 -0.035114 -0.066692 0.038011 -0.019087 0.011121 -0.029621 -0.024403 -0.052654 0.047402 0.006711 -0.064290 -0.0
0.052364 -0.007353 0.006950 0.039280 0.018387 -0.083283 -0.038789 0.022860 -0.029142 0.029422 0.011834 0.073171 -0.025516 0.064107 -0.001747
-0.029875 0.070031 -0.011460 -0.003957 0.025676 0.002881 0.041085 0.009806 0.015105 -0.051295 -0.029721 -0.003456 -0.072049 0.012853 0.05745
0.007060 0.103973 0.024584 0.031729 -0.031754 -0.024805 0.051161 0.042864 -0.021417 0.027601 0.017241 -0.017261 -0.043754 0.008115 -0.017126
-0.055455 -0.063698 0.063268 0.012776 0.005479 -0.033595 -0.063750 0.038983 -0.025671 -0.002447 0.044772 -0.005042 -0.047169 0.030342 0.0006
sparktsao commented 8 years ago

same here, an outlier shown in the first row of result. and even i remove the first row in dataset, the outlier will still happen in the new first row of the new generated 2D data....

FuriouslyCurious commented 8 years ago

Sounds like a line-encoding issue.

@lferry007 used Linux box for development, so he may not be accounting for Mac OS line ending differences in the code. It is a simple fix.

bmschmidt commented 8 years ago

Confirmed something about OS seems likely. I just ran the same file on my Linux machine, and it does work there. What would the fix be?

asxzy commented 8 years ago

@bmschmidt Try to reduce the sample size. Works for me in some cases.

bmschmidt commented 8 years ago

@asxzy: Yeah, eliminating most of the points does make it run; 100K and 25K points fail, but with just 5K points the program is able to run with normal-looking results on OS X. (The real data here is 1 to 5 million points--any smaller and T-SNE is fine for my purposes.)

Hard to see where the point of failure is, though. A single run takes too long for it to be practical to find if its an individual row or a threshold size where it breaks.


EDIT- Ah, didn't understand you meant the 'samples' argument. Entering low values there does not eliminate the problem for me. Test knn accuracy, though, reported as 99.81%.

tjrileywisc commented 8 years ago

Seeing similar behavior on a Windows OS.

@asxzy is this the -samples parameter? Just confirming how it is supposed to be set up- if I have 1 million records, -samples is set to 1e6/1e8 by default, correct? I've been trying this without luck so I may have misunderstood something.

Also I have noticed that test knn accuracy has been reporting 0% when I get this result.

tangjianpku commented 8 years ago

@bmschmidt The link of your data set does not work now. Could you provide a new link? BTW, we've fixed some of the bugs and you may rerun the code on your data set. If there's still a problem, please let us now.

Thanks, Jian https://sites.google.com/site/pkujiantang/home

bmschmidt commented 8 years ago

I just pulled and retested and still (unfortunately) experiencing this issue. I've restored a (smaller) failing sample file under OS X to that link.

tangjianpku commented 8 years ago

We checked the data. It is the problem of data format. In your first line, it indicates there are 1000 lines of data. We change it to 99, and the visualization looks ok now. To generate meaningful visualizations, you may need to upload the entire data set.

Jian

bmschmidt commented 8 years ago

Oh no, in my rush to replace the online I uploaded a truncated version. This was not the original problem. I'm terribly sorry to have wasted your time on that file, and wouldn't blame you for giving up on me now.

If not, I replaced that link with the full 10,000 row set, and it still fails to compile. Pasted below is console output of different behavior under Ubuntu 14.04 and Mac OS X. As you can see, the same SHA1 hash on the input file produces good results in Ubuntu, but bad ones in OS X.

Succeeding, Ubuntu 14.04

bschmidt@sibelius:~/LargeVis/Linux$ openssl sha1 failing_largevis.txt           
SHA1(failing_largevis.txt)= a4b3b7893a5b8f4ef29d82901ae3cb5f654a9abd
bschmidt@sibelius:~/LargeVis/Linux$ wc -l failing_largevis.txt 
10001 failing_largevis.txt
bschmidt@sibelius:~/LargeVis/Linux$ head -1 failing_largevis.txt 
10000 640
bschmidt@sibelius:~/LargeVis/Linux$ ./LargeVis -input failing_largevis.txt -output t.txt -outdim 2 -samples 100
Reading input file failing_largevis.txt ...... Done.
Total vertices : 10000  Dimension : 640
Normalizing ...... Done.
Running ANNOY ...... Done.
Running propagation 3/3
Test knn accuracy : 99.81%
Computing similarities ...... Done.
Fitting model   Alpha: 0.000700 Progress: 99.930%

bschmidt@sibelius:~/LargeVis/Linux$ head t.txt 
10000 2
-14.740220 -5.766308
-1.294758 3.144548
-4.635156 9.079961
-9.019074 -10.023488
16.997780 25.454086
-25.735485 -14.697220
-3.328367 -14.550834
-4.309145 -9.262841
-37.779449 6.267919

Failing, OS X

-bash-3.2$ openssl sha1 failing_largevis.txt
SHA1(failing_largevis.txt)= a4b3b7893a5b8f4ef29d82901ae3cb5f654a9abd
-bash-3.2$ wc -l failing_largevis.txt
   10001 failing_largevis.txt
-bash-3.2$ ./LargeVis -input failing_largevis.txt -output t.txt -outdim 2 -samples 100
Reading input file failing_largevis.txt ...... Done.
Total vertices : 10000  Dimension : 640
Normalizing ...... Done.
Running ANNOY ...... Done.
Running propagation 3/3
Test knn accuracy : 99.81%
Computing similarities ...... Done.
Fitting model   Alpha: 0.000700 Progress: 99.930%

bash-3.2$ head t.txt 
10000 2
-76.238503 -7.682394
0.147972 0.035222
0.147989 0.035224
0.147941 0.035219
0.148011 0.035226
0.147880 0.035213
0.147879 0.035213
0.147913 0.035216
0.147693 0.035194
spamcatcher345 commented 8 years ago

I have the same issue using Ubuntu 15.10. Has anyone solved this yet?

After running Largevis on my dataset, the first point is orders of magnitude larger than the remaining points after dimensionality reduction, resulting in a meaningless plot.

root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012# ./LargeVis -input 1k_points.txt -output 1k_2d.txt Reading input file 1k_points.txt ...... Done. Total vertices : 1000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.98% Computing similarities ...... Done. Fitting model Alpha: 0.000100 Progress: 99.993% root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012#

root@blah-VirtualBox:/home/blah/Desktop/LargeVis/20161012# head 1k_2d.txt 1000 2 -31.457289 -0.287726 12.466423 -0.287530 12.466411 -0.287530 12.466626 -0.287530 12.466501 -0.287530 12.466530 -0.287530 12.466509 -0.287530 12.466496 -0.287530 12.466705 -0.287530

Here is a link to the input data: https://www.dropbox.com/s/bvup56przujg52d/1k_points.txt?dl=0

And a link to the Largevis output: https://www.dropbox.com/s/jk2p0qof2sn7hr9/1k_2d.txt?dl=0

Any guidance would be greatly appreciated. Thank you!

elbamos commented 8 years ago

Guys - @spamcatcher345 posted his data on my git, so I took a look and tried it out with my implementation. In the first place, the data that was uploaded doesn't have 64 dimensions, it has 58. I don't know whether that would affect the output with @lferry007's implementation. Also, the data has 18 duplicates.

When I run my implementation of largeVis on it, what I get looks gaussian.

How was this data generated?

bmschmidt commented 8 years ago

Using this data on Ubuntu 14.04 and @lferry007's implementation, I find that

  1. I do get the problem when using the data as currently displayed
  2. I get a good embedding (no outlier) when I change the header row to reflect it having 58 dimensions.

On OS X, it also works when properly labeled.

elbamos commented 8 years ago

What do you mean by a "good" embedding?

bmschmidt commented 8 years ago

By "good," I mean it does not have the first point as an extreme outlier with no structure in all later points; instead, it looks like T-SNE does when run on data that has some internal structure. So validating by looking at it: here's thje plot. image

spamcatcher345 commented 8 years ago

A mighty "slap" rang out across the land. ./facepalm

Somewhere along the way, while moving files around VMs/hosts/networks, I somehow managed to drop the last 6 columns when creating the 1k sample data... Testing again with 58 dimensions worked in my Ubuntu 15.10 VM, compiled with the current version of GSL. Once it worked with the 1k sample data, I went back to the original data set (which DOES have 64 dimensions) and uncovered a few other self-induced issues caused by the formatting/etl I was doing. Happy to report success with 20k rows of input data. Next step is scaling up to 1-10M+ rows.

A few notes I've made along the way, in case it may help someone else in the future:

--Individual vectors must have digits of precision. Eg: Zeros (0) must be represented as 0.000000 --Execution will fail (core dump) if you have MORE than the actual number of dimensions defined in the first line of the input file. --Execution will not fail if you have LESS than the actual number of dimensions defined in the first line of the input file. (Up to a certain point. When I define 5 dimensions on a set with 64, it fails. When I define 25 dimensions on a set with 64, it succeeds. Not sure what this does to the end result, but it cant be "right") ^^^Suggest bug fix/enhancement here - either better error checking & description or, better yet, parse the file for the correct number of rows & dimensions as part of execution --Generally speaking, there is a need for descriptive error handling. Any number of issues with data or config or otherwise result in the same screen output: Segmentation fault (core dumped) . Advice for others: Be absolutely meticulous.

Thanks again @elbamos , @bmschmidt and @lferry007 ! This has gotten me past the bottleneck of t-SNE for this project. Oh, @elbamos - this is being generated from spam email data, labeled by k-means.

spamcatcher345 commented 8 years ago

Hello again @bmschmidt, @lferry007, @elbamos & all,

Could someone please test this sample data with their own installation of Largevis and let me know if it works for you?

https://www.dropbox.com/s/fky8qec7mf3elfe/50k_500c_1000i_points.zip?dl=0

I've been been going rounds with this, and no luck so far. At some indeterminate point, dependent on the data set, the dimensionality reduction fails and reverts to the symptoms above.

TLDR below, so that you know i've done my homework before asking....

Thanks again in advance.

.................................................. --make sure the number of rows specified in the infile header row exactly matches number of rows in the file wc -l 50k_500c_1000i_points.txt 50000 50k_500c_1000i_points.txt

--make sure all rows have the same number of features while read i; do echo $i | tr ' ' '\n' | wc -l; done < 50k_500c_1000i_points.txt | sort | uniq -c | sort -nr 50000 64

--make sure all values are between 0-1, no funny chars, text, etc. root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | head 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | tail 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000

--Run Largevis on all 50k rows (results in garbage out)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 50k_2d 50000 2 -93.259697 92.923935 0.004132 -0.007910 0.005585 -0.003991 0.005309 -0.004095 0.005638 -0.003910 0.003409 -0.001608 0.005987 -0.003725 0.001725 -0.003395 0.002095 -0.003146

--So, lets try a sample set of the first 10k rows from the same file

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -10000 50k_500c_1000i_points.txt > 10k_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ll

-rwxrwx--- 1 root vboxsf 7050000 Oct 26 16:36 10k_500c_1000i_points.txt -rwxrwx--- 1 root vboxsf 35250000 Oct 19 12:49 50k_500c_1000i_points.txt

--Run Largevis on the first 10k rows (results in success)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 10k_500c_1000i_points.txt -output 10k_2d -samples 10 Reading input file 10k_500c_1000i_points.txt ...... Done. Total vertices : 10000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.99% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 10k_2d 10000 2 11.317126 -16.769339 1.494954 -14.870223 5.892348 -8.776334 -15.464440 1.080700 -4.194473 13.608239 -4.832858 -10.284916 18.223331 0.966998 13.740998 -1.130792 -10.511693 12.002342

--Okay, lets try 25k rows now (results in garbage out)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 25k_500c_100i_points.txt -output 25k_2d -samples 10 Reading input file 25k_500c_100i_points.txt ...... Done. Total vertices : 25000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.83% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 25k_2d 25000 2 -97.675056 87.540543 -0.002500 -0.015151 0.006426 -0.008289 0.006369 -0.008438 0.006459 -0.008278 0.010856 -0.011586 0.006840 -0.008010 0.007054 -0.009687 0.006959 -0.009642

Try with 12k, 15k, 17k, 18k, 19k, and so forth...

It doesnt appear to be caused by the number of rows or number of dimensions. I've run the mnist data successfully, as well as randomly generated 100-dimensional x 50,000 rows of both integers and floats. I've tried changing digits of precision from 2-8 with no effect.

What I'm struggling with is that the cutoff point at which it starts producing garbage out is dependent on the data set. I've had successful runs of 20k rows from a different dataset, but this one stops working after 17998 rows:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -17999 50k_500c_1000i_points.txt > 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# vi 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17998_500c_1000i_points.txt -output 17998_2d -samples 10 Reading input file 17998_500c_1000i_points.txt ...... Done. Total vertices : 17998 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.96% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 17998_2d 17998 2 -4.273000 -5.816042 -9.945149 12.071325 6.342212 6.314149 5.711284 -5.599425 -14.465376 -2.408209 -1.881844 -0.892462 -7.527262 12.491118 2.002883 -5.556159 13.538328 6.924161

It core dumps on 17999 rows:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17999_500c_1000i_points.txt -output 17999_2d -samples 10 Reading input file 17999_500c_1000i_points.txt ...... Done. Total vertices : 17999 Dimension : 64 Normalizing ...... Done. Running ANNOY ......Segmentation fault (core dumped)

And at 18000 rows, its back to garbage out:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 18k_2d 18000 2 -123.057167 -62.150906 0.013629 0.007195 0.010898 0.006385 0.010933 0.006196 0.010904 0.006396 0.004929 0.011840 0.010916 0.006575 0.011646 0.003596 0.011509 0.003789

elbamos commented 8 years ago

If it doesn't visualize with the reference implementation it's exceedingly unlikely that it would work with my implementation. If you'd like to give it a try it should take about 10 minutes to code reading in this data and another 15 to get results.

On Oct 26, 2016, at 7:30 PM, spamcatcher345 notifications@github.com wrote:

Hello again @bmschmidt, @lferry007, @elbamos & all,

Could someone please test this sample data with their own installation of Largevis and let me know if it works for you?

https://www.dropbox.com/s/fky8qec7mf3elfe/50k_500c_1000i_points.zip?dl=0

I've been been going rounds with this, and no luck so far. At some indeterminate point, dependent on the data set, the dimensionality reduction fails and reverts to the symptoms above.

TLDR below, so that you know i've done my homework before asking....

Thanks again in advance.

--make sure the number of rows specified in the infile header row exactly matches number of rows in the file wc -l 50k_500c_1000i_points.txt

50000 50k_500c_1000i_points.txt

--make sure all rows have the same number of features while read i; do echo $i | tr ' ' '\n' | wc -l; done < 50k_500c_1000i_points.txt | sort | uniq -c | sort -nr

50000 64

--make sure all values are between 0-1, no funny chars, text, etc. root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | head 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# cat 50k_500c_1000i_points.txt | tr ' ' '\n' | sort | tail 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000 1.00000000

--Run Largevis on all 50k rows (results in garbage out)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 50k_2d 50000 2 -93.259697 92.923935 0.004132 -0.007910 0.005585 -0.003991 0.005309 -0.004095 0.005638 -0.003910 0.003409 -0.001608 0.005987 -0.003725 0.001725 -0.003395 0.002095 -0.003146

--So, lets try a sample set of the first 10k rows from the same file

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -10000 50k_500c_1000i_points.txt > 10k_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ll

-rwxrwx--- 1 root vboxsf 7050000 Oct 26 16:36 10k_500c_1000i_points.txt -rwxrwx--- 1 root vboxsf 35250000 Oct 19 12:49 50k_500c_1000i_points.txt

--Run Largevis on the first 10k rows (results in success)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 10k_500c_1000i_points.txt -output 10k_2d -samples 10 Reading input file 10k_500c_1000i_points.txt ...... Done. Total vertices : 10000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.99% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 10k_2d 10000 2 11.317126 -16.769339 1.494954 -14.870223 5.892348 -8.776334 -15.464440 1.080700 -4.194473 13.608239 -4.832858 -10.284916 18.223331 0.966998 13.740998 -1.130792 -10.511693 12.002342

--Okay, lets try 25k rows now (results in garbage out)

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 25k_500c_100i_points.txt -output 25k_2d -samples 10 Reading input file 25k_500c_100i_points.txt ...... Done. Total vertices : 25000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 97.83% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 25k_2d 25000 2 -97.675056 87.540543 -0.002500 -0.015151 0.006426 -0.008289 0.006369 -0.008438 0.006459 -0.008278 0.010856 -0.011586 0.006840 -0.008010 0.007054 -0.009687 0.006959 -0.009642

Try with 12k, 15k, 17k, 18k, 19k, and so forth...

It doesnt appear to be caused by the number of rows or number of dimensions. I've run the mnist data successfully, as well as randomly generated 100-dimensional x 50,000 rows of both integers and floats. I've tried changing digits of precision from 2-8 with no effect.

What I'm struggling with is that the cutoff point at which it starts producing garbage out is dependent on the data set. I've had successful runs of 20k rows from a different dataset, but this one stops working after 17998 rows:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head -17999 50k_500c_1000i_points.txt > 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# vi 17998_500c_1000i_points.txt root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17998_500c_1000i_points.txt -output 17998_2d -samples 10 Reading input file 17998_500c_1000i_points.txt ...... Done. Total vertices : 17998 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 95.96% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210% root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 17998_2d 17998 2 -4.273000 -5.816042 -9.945149 12.071325 6.342212 6.314149 5.711284 -5.599425 -14.465376 -2.408209 -1.881844 -0.892462 -7.527262 12.491118 2.002883 -5.556159

13.538328 6.924161

It core dumps on 17999 rows:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# ./LargeVis -input 17999_500c_1000i_points.txt -output 17999_2d -samples 10 Reading input file 17999_500c_1000i_points.txt ...... Done. Total vertices : 17999 Dimension : 64 Normalizing ...... Done. Running ANNOY ......Segmentation fault (core dumped)

And at 18000 rows, its back to garbage out:

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/test# head 18k_2d 18000 2 -123.057167 -62.150906 0.013629 0.007195 0.010898 0.006385 0.010933 0.006196 0.010904 0.006396 0.004929 0.011840 0.010916 0.006575 0.011646 0.003596 0.011509 0.003789

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lferry007 commented 8 years ago

Hi all, We've updated the code and you can have a try. If there is still a problem, feel free to contact us. Thanks!

elbamos commented 8 years ago

@lferry007 I just went through the diff - am I correct that there are no changes to the implementation of the algorithm after the neighbor search step? Thanks

On Nov 2, 2016, at 10:02 AM, lferry007 notifications@github.com wrote:

Hi all, We've updated the code and you can have a try. If there is still a problem, feel free to contact us. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

spamcatcher345 commented 8 years ago

I can confirm from my side that the same file produces meaningful output after updating.

Awesome!

Thanks very much @lferry007 !!

=====================UPDATE 20161102 ==============================

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# md5sum 50k_500c_1000i_points.txt 32f71633ef2a83764258c0378e9906b5 50k_500c_1000i_points.txt

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# ./LargeVis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 94.87% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210%

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# head 50k_2d 50000 2 -126.974144 31.734627 0.011018 0.004690 0.004896 0.003125 0.004526 0.002717 0.005311 0.003130 0.003361 0.000967 0.006694 0.003260 0.003402 -0.003218 0.005849 -0.003446

================Updated to new version=================

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# md5sum 50k_500c_1000i_points.txt 32f71633ef2a83764258c0378e9906b5 50k_500c_1000i_points.txt

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# ./Largevis -input 50k_500c_1000i_points.txt -output 50k_2d -samples 10 Reading input file 50k_500c_1000i_points.txt ...... Done. Total vertices : 50000 Dimension : 64 Normalizing ...... Done. Running ANNOY ...... Done. Running propagation 3/3 Test knn accuracy : 99.91% Computing similarities ...... Done. Fitting model Alpha: 0.007901 Progress: 99.210%

root@blah-VirtualBox:/media/sf_Share/50k_points_master_data/For_help# head 50k_2d 50000 2 1.729691 -6.422237 -10.387295 1.105057 -10.262017 4.892396 -8.165269 7.217187 -1.564149 -9.366737 7.682181 -4.121569 -4.760568 13.656388 -4.644851 -8.798579 2.579137 10.852392

lferry007 commented 8 years ago

@spamcatcher345 Glad to know it works! Thanks!

Jian

bmschmidt commented 8 years ago

This fixes the outlier problem on my previously-broken test data as well. Thanks!