snuspl / dolphin

14 stars 2 forks source link

[DOLPHIN-31] Add PageRank example #47

Closed dongjoon-hyun closed 9 years ago

dongjoon-hyun commented 9 years ago

@jsjason Could you review this PR? :)

jsjason commented 9 years ago

Sure, I was planning to do a run on all of the Dolphin PRs today. Sorry for the delay.

2015년 6월 17일 수요일, Dongjoon Hyunnotifications@github.com님이 작성한 메시지:

@jsjason https://github.com/jsjason Could you review this PR? :)

— Reply to this email directly or view it on GitHub https://github.com/cmssnu/dolphin/pull/47#issuecomment-112492414.

jsjason commented 9 years ago

I ran into a bug. I used this dataset, which was computed from the figure on wikipedia:

1
2       3
3       2
4       1 2
5       2 4 6
6       2 5
7       2 5
8       2 5
9       2 5
10      5
11      5

The nodes 7, 8, 9, 10, 11 are not included in any outList, and thus are not included in the broadcasted data Map<Integer,Double> rank. This leads to a NullPointerException in PageRankCmpTask:83, where the CmpTask tries to lookup nodes 7~11 in rank. Could you kindly apply a fix?

jsjason commented 9 years ago

I've completed my pass. I'll do another as soon as you address my comments.

dongjoon-hyun commented 9 years ago

Thank you. I'm going to fix all today in SNU. :)

dongjoon-hyun commented 9 years ago

Hi, @jsjason . I changed the code according to the first review. Could you check this again?

jsjason commented 9 years ago

I get a ParseException when there are blank lines in the input file, e.g.

1
2       3
3       2
4       1 2
5       2 4 6
6       2 5
7       2 5
8       2 5
9       2 5

10      5
11      5

I think this is because Java's String.split() treats "".split(\\s+) as a non-empty array.

jsjason commented 9 years ago

I did another pass. The whole mechanism seems good, although I think it's worth having a discussion on following either this algorithm's single-stage style, or the clustering algorithm's double-stage style. @dongjoon-hyun, could you kindly review my new comments?

dongjoon-hyun commented 9 years ago

If Dolphin's policy is a double-stage style, I have no objection.

By the way, IMHO, current Dolphin needs too many files. For this simple PageRank example, I made 13 files. For double-stage style, it will be 15~20? I guess Dolphin needs another wrapper classes or DSL in the future. Otherwise, it looks definitely more complicated than MR style.

dongjoon-hyun commented 9 years ago

Anyway, I changed the codes and LICENSE like the following.

- * Copyright (C) 2015 SK Telecom
+ * Copyright (C) 2015 Seoul National University

Thank you for your kind reviews, @jsjason .

jsjason commented 9 years ago

I agree with the fact that a simple example requires more than 10 files is pretty inconvenient. Although we have a DSL issue open right now, it isn't something we can do in just a few days; we need to somehow reduce the number of new files needed for a algorithm without the help of a DSL, quickly. I'll add a separate issue.

jsjason commented 9 years ago

I don't think a single-stage style is inappropriate. I was just concerned that the existing algorithm style is different from the style this PR follows. I don't have any objections on merging this PR, though; we can discuss the stage issue on a separate issue or PR, then refactor one style into the other later.

jsjason commented 9 years ago

This looks good. I'll merge it.

jsjason commented 9 years ago

Before I merge it, @kijungshin also volunteered to do a review.

kijungs commented 9 years ago

@jsjason @dongjoon-hyun I reviewed this PR. It seems good but please review the issue regarding receiveReduceData method of PageRankCtrlTask class before merging it.

kijungs commented 9 years ago

I found that the name of package is pagerank. However its level is different from that of classification, clustering and regression. Personally, ranking or graph sounds good.

kijungs commented 9 years ago

This code assumes the 'Linked List' as an input format, which is rarely used to store large-scale graphs though. It would be great to consider 'Edge List' format too. To handle the format, it seems that outgoing degree of each node should be computed (by reduce) and broadcast to compute tasks.

kijungs commented 9 years ago

What I mean by outgoing degree is the number of outgoing neighbors (not a list of them)

dongjoon-hyun commented 9 years ago

I agree with you that Edge List is better, but, IMHO, in PageRank domain, Adjacency List isn't rarely used format. (not Linked List) I actually assumed a real application extracting outgoing href links from a url or a filename (one html file), so I used Adjacency List. In addition, the point you mentioned is exactly the weak point in `Edge List'. It needs more code. In other words, it's a little bit more complex to be used as a simple case. Why don't you keep this as a intern job? :-)

dongjoon-hyun commented 9 years ago

I changed the code, @kijungshin . Thank you for your kind advice. One more thing, do you think Neural Network related class into graph, too?

jsjason commented 9 years ago

Don't worry about the package name; I'll create a separate issue for managing package names.

jsjason commented 9 years ago

Oh, @dongjoon-hyun already made the change. Nice.

kijungs commented 9 years ago

Thank you for the information!! I didn't know the fact. Supporting Edge List seems a good practice for new interns.

On Thu, Jun 25, 2015 at 1:26 AM, Dongjoon Hyun notifications@github.com wrote:

I agree with you that Edge List is better, but, IMHO, in PageRank domain, Adjacency List isn't rarely used format. (not Linked List) I actually assumed a real application extracting outgoing href links from a url or a filename (one html file), so I used Adjacency List. In addition, the point you mentioned is exactly the weak point in `Edge List'. It needs more code. In other words, it's a little bit more complex to be used as a simple case. Why don't you keep this as a intern job? :-)

— Reply to this email directly or view it on GitHub https://github.com/cmssnu/dolphin/pull/47#issuecomment-114932059.

kijungs commented 9 years ago

@dongjoon-hyun. Thank you for listening to my advice. I thought Neural Network related classes are classified into 'classification'. Other graph mining algorithms (e.g., connected component and shortest path) can be classified into 'graph'.

dongjoon-hyun commented 9 years ago

@kijungshin , I see. I will use classification for Neural Network. By the way, any more comments for this PR? :)

kijungs commented 9 years ago

That's all. Thank you for taking the my comments into consideration!

dongjoon-hyun commented 9 years ago

Hi, @jsjason . Could you merge this PR? I think it's ready to finish review process.

jsjason commented 9 years ago

I tested this on both local and YARN runtime, and it works with the provided sample dataset. I'll merge it.

dongjoon-hyun commented 9 years ago

Thank you!