husky-team / husky

A more expressive and most importantly, more efficient system for distributed data analytics.
http://www.husky-project.com/
Other
99 stars 55 forks source link

[PageRank Example] Second token getting skipped when parsing input lines #305

Closed aminmkhan closed 5 years ago

aminmkhan commented 5 years ago

After reading Vertex ID, avoid Tokenizer iterator getting incremented twice

lmatz commented 5 years ago

Maybe it is due to different file format? "src : dst1 dst2" "src dst1 dst2"

kygx-legend commented 5 years ago

Right. To cope with different format is just to write different parsing function.

kygx-legend commented 5 years ago

Could you help add comment there instead of this modification? Thanks a lot.

aminmkhan commented 5 years ago

Maybe it is due to different file format? "src : dst1 dst2" "src dst1 dst2"

To handle such different formats in my opinion, better approach would be to update separators:

https://github.com/husky-team/husky/blob/eda5e3aaf8cf795dfbd90ee5ddece44907ccb664/examples/pagerank.cpp#L55

For example:

        boost::char_separator<char> sep(" \t,:;");
kygx-legend commented 5 years ago

This approach is okay for me but if any other separator?

zzxx-husky commented 5 years ago

Emm.. That it++ may be to skip the number of neighbors. I think the program assumes the format of each line is : source num_neighbors neighbor_1 neighbor_2 ... neighbor_n. This format is quite common by the way.

aminmkhan commented 5 years ago

Agree to all your points, without any comments this could be confusing, especially when working with different file formats. Updated with more comments.

https://github.com/husky-team/husky/blob/711db5b487dc3ca2c0e1695f5d223381e648f17a/examples/pagerank.cpp#L55-L61

kygx-legend commented 5 years ago

Thanks for the fixing!