Deduplicating PGA with apollo post

r0mainK commented 6 years ago

@vmarkovtsev sorry for the delay, here is a first version of the post

Still missing:

waiting for the community detection on the 4 largest graphs, obtained for the 80% threshold, to end -> all 4 have over 100k nodes (see in blog post)
add one last graph so ruby is represented, an maybe a sixth
add links to the Models (waiting to push CCs and CMDs, bags are already up)
~add an additional doc for apollo to describe how to use the models~ : PRed and might be worth to do the same for the BOW model in sourced-ml

r0mainK commented 6 years ago

Sure no problem if you think it will be less time consuming, im not gonna be able to do much today (dentist and moving in my flat), so unless you get around doing it today ill push the final graph and its description tonight

vmarkovtsev commented 6 years ago

Staging: https://blog-staging.srcd.run/post/deduplicating_pga_with_apollo/

r0mainK commented 6 years ago

just added the last graph, put in third place given it's size

also reviewed the first paragraph, looks good to me minus a typo I corrected

vmarkovtsev commented 6 years ago

@r0mainK Is it possible to replace all JPG plots with PNG?

r0mainK commented 6 years ago

Sure np will do it this evening, Ill just screen shot the images EDIT: done

vmarkovtsev commented 6 years ago

@r0mainK Can you please add labels to images, e.g.

{{% caption src="/post/difftree/names.png" %}}
An example of a Git tree with some names in their nodes. The names of the nodes are shown between double quotes.
{{% /caption %}}

r0mainK commented 6 years ago

@vmarkovtsev sure, the graphs only or also the plots and pie charts ?

vmarkovtsev commented 6 years ago

All the graphics - some people look only at the images and the captions.

vmarkovtsev commented 6 years ago

@r0mainK what is "Cliques count"? A clique is a fully connected part of the graph and we definitely could not find them because it is an NP-hard problem. Did you mean edges?

vmarkovtsev commented 6 years ago

Sorry, question closed, I have read the next paragraph :) Changing to buckets.

vmarkovtsev commented 6 years ago

Actually, they are indeed cliques...

vmarkovtsev commented 6 years ago

@r0mainK Did you weight the features? Cannot find anywhere that you mention it.

vmarkovtsev commented 6 years ago

Staging updated. Is it possible to increase the fonts in the bar charts, histograms and the ratio of distinct filenames plot?

r0mainK commented 6 years ago

@vmarkovtsev no I didnt but thought can do it the same way I did for the average feature count (might be better to replace it rather then just add it). it will take a day or two though. for the fonts pie chart and histograms Ill do it now, but ratio of distinct filenames it will take some time, since it scales quadratically with the number of filenames per cc

vmarkovtsev commented 6 years ago

According to @EgorBu weighting gives a big accuracy boost. It also suggests the section about how we labeled our pairs and used them to optimize which @EgorBu promised to write in https://github.com/src-d/backlog/issues/1253

r0mainK commented 6 years ago

@vmarkovtsev oh okay didn't understand the question, I thought you were asking if I'd counted the average weights for each feature and per lang - which I just just did. So if you're question is whether I used weights during the hashing for each feature kind no I didn't - I used weights equal to 1. If you're asking whether the bag of features were weighted - yes naturally.

However although @EgorBu 's results could be mentioned when we talk about lack of metrics, they couldn't rly be used, as he tuned for only one language, whereas we're using 5 here - and on a restrained number of files - and as shown below not only the average counts vary, the average weighted counts should also given the average weights:

	identifiers	literals	graphlets	children	uast2seq	node2vec
Average weight cross-language	6.42	6.94	4.98	3.93	4.78	7.67
Average weight for Python files	6.78	7.28	5.44	4.63	5.26	8.91
Average weight for Java files	6.35	7.12	4.48	3.46	4.23	6.64
Average weight for Javascript files	6.34	6.71	4.83	3.36	4.67	7.25
Average weight for Ruby files	5.14	7.01	6.20	5.08	6.27	10.81
Average weight for Go files	6.68	7.35	5.36	4.64	4.88	8.29

Anyway should I replace the counts with previous weighted counts ? It might take some rewriting to the paragraph. Or I could just append results

vmarkovtsev commented 6 years ago

I see. Is it hard to rehash the files with

identifiers weighted to 1
literals weighted to 1.25
graphlets weighted 2.5
the rest weighted to 0?

In theory, it will take us less than 4 days without much human participation. Unfortunately, this means the numbers will change.

If that's too much, no worries.

vmarkovtsev commented 6 years ago

@r0mainK I dumped the current state, staging updated.

There is only one thing which bothers me: shouldn't we measure the similar file names ratio (very smart idea btw!) in the detected communities instead of the connected components?

vmarkovtsev commented 6 years ago

Also what was the timeout for detecting communities in the 4 largest CCs?

r0mainK commented 6 years ago

@vmarkovtsev

hash and cc shouldnt take much time, cmd might take a bit more time, depending on the ccs but its feasible, however are you sure we replace and not just append ? it means we will not be using node2vec, children, uast2seq - which is too bad. also, appending might be a bit much, but we'll be able to compare naive vs pseudo optimized - anyway lauching on it now, cluster is not being used atm
thanks ^^ yeah I was kinda thinking about that when doing the Ruby graph, the thing is it might start to be a lot of different plots if we have 4 types of ccs / cmds. it gives nearly the same result for 95% btw, and probly for 80% too (gonna go eat now):

timeout was a day or two for walktrap, less for infomap and more for fastgreedy

vmarkovtsev commented 6 years ago

Awesome. No need to show file names in CCs, just communities is enough.

Regarding the weights, indeed we are not using all the feature type we coded - that's what hyperoptimization found. There are still small weights assigned, we ignore them because they do not carry new information.

r0mainK commented 6 years ago

@vmarkovtsev okay, then do you want to take out the naive results altogether, or not ? if not it might be worth to at least say we ran on the non optimized feature partition and talk about the difference we saw. Ill post results when hashing is rdy, and reupload the log log hist, drop for communities, and pie chart with increased text, as well as update all stats

vmarkovtsev commented 6 years ago

Sounds good.

vmarkovtsev commented 6 years ago

@r0mainK I have finished the first pass over the text until DRY gophers (staging updated). Please read the post and report any conceptual errors which I probably introduced. It will also help to see which plots or tables need to be updated.

r0mainK commented 6 years ago

@vmarkovtsev so with the update using Egor's hyperparams we will have to change all stats and plots from the maneuvering section, as well as add the paragraph on hyperparams at the begininning of it. apart from that, a few things:

multi language cc's were for the looser threshold
for the number of duplicate i think best is to use the number of communities + single ccs, as the number of ccs is too loose, in my opinion. also, saying we view CCs to be equivalent to single files is a bit contradictory with doing community detection
we might be able to remove the end of maneuvring if we dont have very large ccs with hyperparams (and anyway the graphs wont be as big, since well be removing buckets)

I havent seen anything else that seems erroneous. For the new hyperparams, I almost finished hashing for the 95% threshold, I expect to have everything by tomorrow if CCs are not too large.

vmarkovtsev commented 6 years ago

Hyperparams should not change the sizes much.

vmarkovtsev commented 6 years ago

@r0mainK I finished editing (staging updated). I didn't understand how to resolve your points, so please fix them in your own commit.

The post is great and should be interesting to read for everybody from regular engineers (2nd part) to ML people (1st part).

vmarkovtsev commented 6 years ago

@m09 I believe your feedback will be valuable here.

vmarkovtsev commented 6 years ago

@m09 This is the biggest difference between a paper and a blog post: the bitter majority of the people who read the latter could not care less about the theory, and they want to be entertained, not taught. The point about each formula in a book reducing the audience by half is true. So our assumption about the "cutting edge" posts is that those who are interested will read deeper, those who are not (90%) will ack the pics, remember the keywords and put a plus on Reddit.

Splitting blog posts into parts is usually a bad idea unless there is more than one topic in the series. We could write a post specifically about the theory, but there is one big problem: there is nobody to write it. Romain is out, myself has other important posts to write. If you know other, better links to the fundamentals - let's add them.

@r0mainK Do you want to fix the review suggestions yourself or delegate it to me? There are also points which I cannot fix myself.

r0mainK commented 6 years ago

@vmarkovtsev no don't worry, I've just been unable to work on it as much with school, and the cluster crashed a couple times so I couldn't work as fast as I wanted. I think by tomorrow night or possibly tonight I should have updated everything hopefully, sorry for the delay. With the new hyperparams there are some changes so you might have to do a last review once I push though.

m09 commented 6 years ago

@vmarkovtsev I don't fully agree on the role of blogs. To me they also are a nice place to vulgarize (papers are not). But since vulgarizing would require some rewriting I agree that we should keep this one as-is.

r0mainK commented 6 years ago

@vmarkovtsev pushed everything, didnt do anything related to this comment:

Lacks an introduction of the techno used. In the following paragraph we hear plenty of problems about techno we were not introduced to (like k8s, siva, and jgit-spark-connector). Giving a link doesn't replace telling the reader why they're used here in a few words, since most readers won't follow those links.

I think it's ready for a final review/rewrite on your side, if you want I can add this paragraph in the evening, but Im in class all day so cant do it now

r0mainK commented 6 years ago

@vmarkovtsev fuck I forgot to pull your commits and amended -> push forced my changes, I htink it squashed yours :/

vmarkovtsev commented 6 years ago

@r0mainK no worries, as long as they are not erased

r0mainK commented 6 years ago

I meant erased, was hoping you still have them ?

vmarkovtsev commented 6 years ago

@r0mainK No panic, in this case everything has been preserved :)

vmarkovtsev commented 5 years ago

Second pass done. Staging updated. @campoy please review this monster - Romain spent 6 months on this.

vmarkovtsev commented 5 years ago

cc @vcoisne

vmarkovtsev commented 5 years ago

@campoy @vcoisne Friendly 1 week ping

vcoisne commented 5 years ago

@vcoisne scheduled for October 4th. Check out our Blog & Press schedule the dates blog post will be published.

vmarkovtsev commented 5 years ago

@vcoisne I mean this needs a review from the Devrel team...

vmarkovtsev commented 5 years ago

Published as https://blog.sourced.tech/post/deduplicating_pga_with_apollo/

@r0mainK Congratulations! Your internship is now officially complete :-)

src-d / blog

Deduplicating PGA with apollo post #242