haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.04k stars 1.13k forks source link

ManiFoldLearning-ISOMap; java.lang.ArrayIndexOutOfBoundsException #175

Closed aminaaslam closed 7 years ago

aminaaslam commented 7 years ago

Hi Hai, I am running into this issue while running ISOMap [main] INFO smile.manifold.IsoMap - IsoMap: 2 connected components, largest one has 986 samples. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10 at smile.math.matrix.EigenValueDecomposition.tql2(EigenValueDecomposition.java:1404) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:629) at smile.math.matrix.EigenValueDecomposition.decompose(EigenValueDecomposition.java:422) at smile.math.Math.eigen(Math.java:4316) at smile.manifold.IsoMap.<init>(IsoMap.java:179) at com.smile.dimensionality.reduction.IsoMapLearner.learn(IsoMapLearner.java:80) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85) at com.common.ModelingEngine.main(ModelingEngine.java:81)

haifengl commented 7 years ago

Can you share your data? I will debug it. BTW, did you test this with 1.3.0? Have you tried 1.2.3? I made big changes to matrix computation. Want to make sure if the changes cause this. Thanks!

aminaaslam commented 7 years ago

HI Hai, I am using smile version 1.2.3. Do you think i should try 1.3.0??

haifengl commented 7 years ago

I will test it with 1.3.0 (latest version) anyway.

aminaaslam commented 7 years ago

ok I will test it with the latest version and let you know. until then i will keep the issue opne.

Thanks

haifengl commented 7 years ago

I guest that this is the same data as in ticket 174. Can you first check if you have duplicated samples in your data? Thanks!

aminaaslam commented 7 years ago

hi hai, you guessed it right. there may be duplicate samples in the data. Is duplicate data instances causing the problem? Does this mean i need to remove duplicate data instances from data?? or version 1.3.0 works around this problem.

haifengl commented 7 years ago

Duplicated sample will cause the distance matrix singular, which cause the issue in ticket 174 for sure. You should remove duplicated samples. no work around for singular matrix.

This ticket might be cause by the duplicated samples. But I am not sure. Thanks!

aminaaslam commented 7 years ago

Let me remove the duplicated samples and see what happens. I will revert back. Thanks

aminaaslam commented 7 years ago

Hi Hai, I ran some experiments on my data which didnt have duplicate samples and here are the results from my experiments. I believe ticket 174 are not linked ManiFold Learning- LLE

Cards data. none:works fine

Standardize: Gives this error

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 402 at smile.manifold.LLE.(LLE.java:209) at com.smile.dimensionality.reduction.LLELearner.learn(LLELearner.java:67) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85) at com.common.ModelingEngine.main(ModelingEngine.java:81)

Normalize: Gives this error

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 402 at smile.manifold.LLE.(LLE.java:209) at com.smile.dimensionality.reduction.LLELearner.learn(LLELearner.java:67) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85) at com.common.ModelingEngine.main(ModelingEngine.java:81)

ManiFold Learning- ISOMap

"method" : "isomap.learner", "parameters" : { "d" : 2, "k" : 5, "normMethod" : "normalize", }

"method" : "isomap.learner", "parameters" : { "d" : 2, "k" : 5, "normMethod" : "standardize", }

"method" : "isomap.learner", "parameters" : { "d" : 2, "k" : 5, "normMethod" : "none", }

haifengl commented 7 years ago

Does "d" means the dimensionality of input data? If so, k = 5 is probably too big.

haifengl commented 7 years ago

Also, normalization and standardization may not be good ideas for manifold learning. They are mostly for classification.

aminaaslam commented 7 years ago

yes d means dimensions of data. i can have only two values in it 2 or 3. So what would be a good range for k for these dimensions???

haifengl commented 7 years ago

In general k should be less than d. The purpose of manifold learning is to find the intrinsic dimensions, which should be smaller.

aminaaslam commented 7 years ago

Hi Hai, Referring to your earlier comment k should be less than d. Then how do i run manifold learning on Mnist dataset and get these results. http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html#sphx-glr-auto-examples-manifold-plot-lle-digits-py Here in the experiment k = 30 and number of dimensions =2 ??

haifengl commented 7 years ago

The dimension of MNIST is 28 X 28 = 784. You are confused with the t-SNE plot.

aminaaslam commented 7 years ago

This is one of the examples in the link n_neighbors = 30

Isomap projection of the digits dataset

print("Computing Isomap embedding")

t0 = time()

X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)

print("Done.")

plot_embedding(X_iso,

           "Isomap projection of the digits (time %.2fs)" %

           (time() - t0))
haifengl commented 7 years ago

Sorry, there were miscommunications. I was asking if d is the input dimension in your settings. You said yes. In our API, d is the output dimension.

aminaaslam commented 7 years ago

I am sorry for the miscommunication. This means i can set k greater the output dimensions of the data. So when i do that it gives me this error. Is it because of duplicate data samples? But when i set k <d(output of dimensions) this error disappears? can you please explain what is going on?. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 402 at smile.manifold.LLE.(LLE.java:209) at com.smile.dimensionality.reduction.LLELearner.learn(LLELearner.java:67) at com.smile.dimensionality.reduction.ManifoldLearningFunction.execute(ManifoldLearningFunction.java:85)

haifengl commented 7 years ago

Duplicates are more likely the issue.

aminaaslam commented 7 years ago

Hi Hai, So i made sure there are no duplicates in my data but when i give these parameters it gives me this exception d = 2(dimensions of output data ) k =3

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See  for further details. Exception in thread "main" java.lang.RuntimeException: Matrix is singular.   at smile.math.matrix.LUDecomposition.solve(LUDecomposition.java:254)   at smile.manifold.LLE.(LLE.java:178)    I have raised a similar ticket https://github.com/haifengl/smile/issues/174 What do you think is the cause of this issue ?

haifengl commented 7 years ago

Can you share the data? I will debug it. Sometimes if two samples are too close, the distance matrix might be singular or near singular, which will cause the problem. Thanks!

aminaaslam commented 7 years ago

Here is the data that i am using and the data description file.

aminaaslam commented 7 years ago

iso-card.json.txt

CardOperations-Training.csv.txt.gz

aminaaslam commented 7 years ago

.json is the data descriptor the other file is data.

Thanks,

haifengl commented 7 years ago

Thanks! Do you have the code snippet too?

aminaaslam commented 7 years ago

For parsing the data ??

haifengl commented 7 years ago

For parsing and also the call to LLE. Thanks!

aminaaslam commented 7 years ago

Hai, I am using univocity parser for parsing the data. import com.univocity.parsers.csv.CsvParser; import com.univocity.parsers.csv.CsvParserSettings;

I dont know how to share the code with you because its so interdependent that i will have to share the entire project with you and thats a total waste of your time. This is the best i could do

double data[][]; IsoMap isomap = new IsoMap(data, d, k,true); this.coordinates = isomap.getCoordinates(); this.graph = isomap.getNearestNeighborGraph();

haifengl commented 7 years ago

Thanks! What's your k and d? You already filtered the duplicates in the attached files, right?

aminaaslam commented 7 years ago

k = k-neighbor d= dimensions of output data Yes there are no duplicates in the data. Hai, one more thing i actually ran Iso map on a data set with duplicate samples and it worked fine. Its just that there were only 100 records in there. Is size of data with higher number of k the cause of this exception?

haifengl commented 7 years ago

Large k is not recommended in general. I know the meaning of k and d :) I was asking their values in your settings that cause the problem.

aminaaslam commented 7 years ago

Sorry , This is the value that i used "d" : 2, "k" : 3,

haifengl commented 7 years ago

Can you serialize the parsed data (the data matrix) into a plain csv file or Java object file? I am afraid that I will load your data incorrectly, which is pretty complicated. BTW, IsoMap/LLE uses Euclidean distance, which seem not appropriate to your data, which is mix of numeric, nominal and string data.

haifengl commented 7 years ago

You better first convert nominal values to one-hot encoding.

haifengl commented 7 years ago

Also don't include operation id and timestamp in the features. If you have to use timestamp, better convert it to things like day of month, day of weeks, etc. Feature engineering is very important in machine learning.

aminaaslam commented 7 years ago

Here is the parsed data with no Strings and nominal features( i have done one-hot encoding). Let me know if this is what u wanted. Please find attached the .csv file and this data goes into the ISO-Map learner. outdata.csv.gz.zip

aminaaslam commented 7 years ago

I am not using any String features and i am converting nominal features. Please let me know if this makes sense.

haifengl commented 7 years ago

Thanks! Sounds good. I will try it tonight.

aminaaslam commented 7 years ago

outdata.csv.gz

aminaaslam commented 7 years ago

this is .gz file. You should be able to open this. THanks!!!

aminaaslam commented 7 years ago

Hi Hai, Did u get a chance to look at the file that i sent you. Thanks!!!!

haifengl commented 7 years ago

Got OOM error on a small machine last night. Will try it on a bigger machine.

aminaaslam commented 7 years ago

Hi Hai, I have a good news so i have a smaller data set which has 3000 records and ISOMap works on as big a value of k=50. I am not able to test this data set that i attached here because i am running into OOM even on my biggest machine and with k=3. However, LLE doesnot work even with k=3 on my smaller data set and throws the exception that i mentioned in ticket: 174 and here is the exception Exception in thread "main" java.lang.RuntimeException: Matrix is singular. at smile.math.matrix.LUDecomposition.solve(LUDecomposition.java:254) at smile.manifold.LLE.(LLE.java:178) I hope i am not confusing you any further. Thanks!!!

haifengl commented 7 years ago

I see the exception in IsoMap, which happens during the eigen value decomposition. I will try to figure out what's wrong. It may take some time.

aminaaslam commented 7 years ago

Thanks that will be very helpful. Amina

aminaaslam commented 7 years ago

Hi Hai, Did u get a chance to look at what is going one with Eigen Value Decomposition. Please let me know when you get a chance to look at this issue. Thanks, Amina

haifengl commented 7 years ago

Likely it is a numeric stability issue. It is also why you have fewer problems with smaller data. For long term, we should use something like blas and lapack, which are more numerical stable and also faster. I am looking into how to do it. But it will take a lot of time.