joandre / MCL_spark

An implementation of Markov Clustering algorithm for Spark in Scala
MIT License
34 stars 13 forks source link

Write an element to an index ontside vector #2

Closed cperezmig closed 8 years ago

cperezmig commented 8 years ago

I found this error while using MCL_spark over a directed graph these features: scala> nodes.count res: Long = 6021 scala> nodes.take(2) res: Array[(Long, String)] = Array((0, A), (1,B))

scala> relationships.count res: Long = 37024 scala> relationships.take(2) res: Array[org.apache.spark.graphx.Edge[Double]] = Array(Edge(0,3582,1.0), Edge(0,713,1.0))

scala> val graph = Graph(nodes, relationships).cache() graph: org.apache.spark.graphx.Graph[String,Double] = org.apache.spark.graphx.impl.GraphImpl@

scala> val clusters: RDD[Assignment] = new MCL().setExpansionRate(2).run(graph).assignments Error: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 6351 because the declared size of your vector is 6021

I though that maybe the problem was in my data, but using the original implementation of Stijn van Dongen I have obtained 509 clusters. Any idea of the origin the error?

joandre commented 8 years ago

Hi! First, thanks for your message. I've just committed to correct some typos. Let me know if it works now.

Otherwise, can you upload your dataset as .txt file in a new comment, so I can check specifically what's going wrong?

joandre commented 8 years ago

After a few tests, I reproduced the same error than yours. Give me some time to investigate the problem.

cperezmig commented 8 years ago

Hi Joandre, thanks for answering so fast. I made more test with other data and still obtaining the same error. If you need more datasets, I have attached one graph with this comment. Is a graph with the internal links from a website.

edges.txt nodes.txt

joandre commented 8 years ago

Hi cperezmig,

I found the error. It was due to adjacency matrix creation function. I used nodes id as column indices. Most of the time, those ids are not related to the number of nodes in the graph.

In your example, while there are 592 nodes, last node id is 1841, so when you try to allocate a value to the 1841th column, it gives you "you may not write an element to index 1841 because the declared size of your vector is 592". My test graph were biased because there ids matched perfectly with the number of nodes.

I have created a lookup table, so each node id is associated with a unique correct column index.

Next commit will correct that issue, but for the moment, results of my algorithm on your dataset does not match the one of classical MCL implementation of Stijn van Dongen.

Let me know if it's ok to close this issue. It will probably be great to open a new one for the last problem.

cperezmig commented 8 years ago

Hi Joandre,

Sorry for the late of my answer. I confirm you that the problem seems to be fixed but that the results are different from the classical MCL implementation. For my part, you can close the issue. Let me know if you advance with the latter problem.

Thanks

Carlos Pérez Miguel

2015-12-30 18:28 GMT+01:00 joandre notifications@github.com:

Hi cperezmig,

I found the error. It was due to adjacency matrix creation function. I used nodes id as column indices. Most of the time, those ids are not related to the number of nodes in the graph.

In your example, while there are 592 nodes, last node id is 1841, so when you try to allocate a value to the 1841th column, it gives you "you may not write an element to index 1841 because the declared size of your vector is 592". My test graph were biased because there ids matched perfectly with the number of nodes.

I have created a lookup table, so each node id is associated with a unique correct column index.

Next commit will correct that issue, but for the moment, results of my algorithm on your dataset does not match the one of classical MCL implementation of Stijn van Dongen.

Let me know if it's ok to close this issue. It will probably be great to open a new one for the last problem.

— Reply to this email directly or view it on GitHub https://github.com/joandre/MCL_spark/issues/2#issuecomment-168040140.

joandre commented 8 years ago

Hi Carlos,

The problem is solved. It remains differences for your dataset, but from integration tests I have made, my implementation is closed to the original one. Main differences are in sparseness management of stochastic matrices. Stijn van Dongen uses much complex approaches than I do but I keep a simple method for now.

Joan

cperezmig commented 8 years ago

Hi Joan,

Perfect, for me the problem is solved. Thanks

Carlos