Closed dongjoon-hyun closed 9 years ago
Sure, I was planning to do a run on all of the Dolphin PRs today. Sorry for the delay.
2015년 6월 17일 수요일, Dongjoon Hyunnotifications@github.com님이 작성한 메시지:
@jsjason https://github.com/jsjason Could you review this PR? :)
— Reply to this email directly or view it on GitHub https://github.com/cmssnu/dolphin/pull/47#issuecomment-112492414.
I ran into a bug. I used this dataset, which was computed from the figure on wikipedia:
1
2 3
3 2
4 1 2
5 2 4 6
6 2 5
7 2 5
8 2 5
9 2 5
10 5
11 5
The nodes 7, 8, 9, 10, 11 are not included in any outList, and thus are not included in the broadcasted data Map<Integer,Double> rank
. This leads to a NullPointerException
in PageRankCmpTask:83, where the CmpTask tries to lookup nodes 7~11 in rank
. Could you kindly apply a fix?
I've completed my pass. I'll do another as soon as you address my comments.
Thank you. I'm going to fix all today in SNU. :)
Hi, @jsjason . I changed the code according to the first review. Could you check this again?
I get a ParseException when there are blank lines in the input file, e.g.
1
2 3
3 2
4 1 2
5 2 4 6
6 2 5
7 2 5
8 2 5
9 2 5
10 5
11 5
I think this is because Java's String.split()
treats "".split(\\s+)
as a non-empty array.
I did another pass. The whole mechanism seems good, although I think it's worth having a discussion on following either this algorithm's single-stage style, or the clustering algorithm's double-stage style. @dongjoon-hyun, could you kindly review my new comments?
If Dolphin's policy is a double-stage style, I have no objection.
By the way, IMHO, current Dolphin needs too many files. For this simple PageRank example, I made 13 files. For double-stage style, it will be 15~20? I guess Dolphin needs another wrapper classes or DSL in the future. Otherwise, it looks definitely more complicated than MR style.
Anyway, I changed the codes and LICENSE like the following.
- * Copyright (C) 2015 SK Telecom
+ * Copyright (C) 2015 Seoul National University
Thank you for your kind reviews, @jsjason .
I agree with the fact that a simple example requires more than 10 files is pretty inconvenient. Although we have a DSL issue open right now, it isn't something we can do in just a few days; we need to somehow reduce the number of new files needed for a algorithm without the help of a DSL, quickly. I'll add a separate issue.
I don't think a single-stage style is inappropriate. I was just concerned that the existing algorithm style is different from the style this PR follows. I don't have any objections on merging this PR, though; we can discuss the stage issue on a separate issue or PR, then refactor one style into the other later.
This looks good. I'll merge it.
Before I merge it, @kijungshin also volunteered to do a review.
@jsjason @dongjoon-hyun I reviewed this PR. It seems good but please review the issue regarding receiveReduceData
method of PageRankCtrlTask
class before merging it.
I found that the name of package is pagerank
. However its level is different from that of classification
, clustering
and regression
. Personally, ranking
or graph
sounds good.
This code assumes the 'Linked List' as an input format, which is rarely used to store large-scale graphs though. It would be great to consider 'Edge List' format too. To handle the format, it seems that outgoing degree of each node should be computed (by reduce) and broadcast to compute tasks.
What I mean by outgoing degree is the number of outgoing neighbors (not a list of them)
I agree with you that Edge List is better, but, IMHO, in PageRank domain, Adjacency List isn't rarely used format. (not Linked List) I actually assumed a real application extracting outgoing href links from a url or a filename (one html file), so I used Adjacency List. In addition, the point you mentioned is exactly the weak point in `Edge List'. It needs more code. In other words, it's a little bit more complex to be used as a simple case. Why don't you keep this as a intern job? :-)
I changed the code, @kijungshin . Thank you for your kind advice. One more thing, do you think Neural Network related class into graph, too?
Don't worry about the package name; I'll create a separate issue for managing package names.
Oh, @dongjoon-hyun already made the change. Nice.
Thank you for the information!! I didn't know the fact. Supporting Edge List seems a good practice for new interns.
On Thu, Jun 25, 2015 at 1:26 AM, Dongjoon Hyun notifications@github.com wrote:
I agree with you that Edge List is better, but, IMHO, in PageRank domain, Adjacency List isn't rarely used format. (not Linked List) I actually assumed a real application extracting outgoing href links from a url or a filename (one html file), so I used Adjacency List. In addition, the point you mentioned is exactly the weak point in `Edge List'. It needs more code. In other words, it's a little bit more complex to be used as a simple case. Why don't you keep this as a intern job? :-)
— Reply to this email directly or view it on GitHub https://github.com/cmssnu/dolphin/pull/47#issuecomment-114932059.
@dongjoon-hyun. Thank you for listening to my advice. I thought Neural Network related classes are classified into 'classification'. Other graph mining algorithms (e.g., connected component and shortest path) can be classified into 'graph'.
@kijungshin , I see. I will use classification for Neural Network. By the way, any more comments for this PR? :)
That's all. Thank you for taking the my comments into consideration!
Hi, @jsjason . Could you merge this PR? I think it's ready to finish review process.
I tested this on both local and YARN runtime, and it works with the provided sample dataset. I'll merge it.
Thank you!
@jsjason Could you review this PR? :)