Closed r0mainK closed 5 years ago
Sure no problem if you think it will be less time consuming, im not gonna be able to do much today (dentist and moving in my flat), so unless you get around doing it today ill push the final graph and its description tonight
just added the last graph, put in third place given it's size
also reviewed the first paragraph, looks good to me minus a typo I corrected
@r0mainK Is it possible to replace all JPG plots with PNG?
Sure np will do it this evening, Ill just screen shot the images EDIT: done
@r0mainK Can you please add labels to images, e.g.
{{% caption src="/post/difftree/names.png" %}}
An example of a Git tree with some names in their nodes. The names of the nodes are shown between double quotes.
{{% /caption %}}
@vmarkovtsev sure, the graphs only or also the plots and pie charts ?
All the graphics - some people look only at the images and the captions.
@r0mainK what is "Cliques count"? A clique is a fully connected part of the graph and we definitely could not find them because it is an NP-hard problem. Did you mean edges?
Sorry, question closed, I have read the next paragraph :) Changing to buckets.
Actually, they are indeed cliques...
@r0mainK Did you weight the features? Cannot find anywhere that you mention it.
Staging updated. Is it possible to increase the fonts in the bar charts, histograms and the ratio of distinct filenames plot?
@vmarkovtsev no I didnt but thought can do it the same way I did for the average feature count (might be better to replace it rather then just add it). it will take a day or two though. for the fonts pie chart and histograms Ill do it now, but ratio of distinct filenames it will take some time, since it scales quadratically with the number of filenames per cc
According to @EgorBu weighting gives a big accuracy boost. It also suggests the section about how we labeled our pairs and used them to optimize which @EgorBu promised to write in https://github.com/src-d/backlog/issues/1253
@vmarkovtsev oh okay didn't understand the question, I thought you were asking if I'd counted the average weights for each feature and per lang - which I just just did. So if you're question is whether I used weights during the hashing
for each feature kind no I didn't - I used weights equal to 1. If you're asking whether the bag of features were weighted - yes naturally.
However although @EgorBu 's results could be mentioned when we talk about lack of metrics, they couldn't rly be used, as he tuned for only one language, whereas we're using 5 here - and on a restrained number of files - and as shown below not only the average counts vary, the average weighted counts should also given the average weights:
identifiers | literals | graphlets | children | uast2seq | node2vec | |
---|---|---|---|---|---|---|
Average weight cross-language | 6.42 | 6.94 | 4.98 | 3.93 | 4.78 | 7.67 |
Average weight for Python files | 6.78 | 7.28 | 5.44 | 4.63 | 5.26 | 8.91 |
Average weight for Java files | 6.35 | 7.12 | 4.48 | 3.46 | 4.23 | 6.64 |
Average weight for Javascript files | 6.34 | 6.71 | 4.83 | 3.36 | 4.67 | 7.25 |
Average weight for Ruby files | 5.14 | 7.01 | 6.20 | 5.08 | 6.27 | 10.81 |
Average weight for Go files | 6.68 | 7.35 | 5.36 | 4.64 | 4.88 | 8.29 |
Anyway should I replace the counts with previous weighted counts ? It might take some rewriting to the paragraph. Or I could just append results
I see. Is it hard to rehash the files with
In theory, it will take us less than 4 days without much human participation. Unfortunately, this means the numbers will change.
If that's too much, no worries.
@r0mainK I dumped the current state, staging updated.
There is only one thing which bothers me: shouldn't we measure the similar file names ratio (very smart idea btw!) in the detected communities instead of the connected components?
Also what was the timeout for detecting communities in the 4 largest CCs?
@vmarkovtsev
hash
and cc
shouldnt take much time, cmd
might take a bit more time, depending on the ccs but its feasible, however are you sure we replace and not just append ? it means we will not be using node2vec
, children
, uast2seq
- which is too bad. also, appending might be a bit much, but we'll be able to compare naive vs pseudo optimized - anyway lauching on it now, cluster is not being used atm
thanks ^^ yeah I was kinda thinking about that when doing the Ruby graph, the thing is it might start to be a lot of different plots if we have 4 types of ccs / cmds. it gives nearly the same result for 95% btw, and probly for 80% too (gonna go eat now):
Awesome. No need to show file names in CCs, just communities is enough.
Regarding the weights, indeed we are not using all the feature type we coded - that's what hyperoptimization found. There are still small weights assigned, we ignore them because they do not carry new information.
@vmarkovtsev okay, then do you want to take out the naive results altogether, or not ? if not it might be worth to at least say we ran on the non optimized feature partition and talk about the difference we saw. Ill post results when hashing is rdy, and reupload the log log hist, drop for communities, and pie chart with increased text, as well as update all stats
Sounds good.
@r0mainK I have finished the first pass over the text until DRY gophers (staging updated). Please read the post and report any conceptual errors which I probably introduced. It will also help to see which plots or tables need to be updated.
@vmarkovtsev so with the update using Egor's hyperparams we will have to change all stats and plots
from the maneuvering
section, as well as add the paragraph on hyperparams at the begininning of it. apart from that, a few things:
I havent seen anything else that seems erroneous. For the new hyperparams, I almost finished hashing for the 95% threshold, I expect to have everything by tomorrow if CCs are not too large.
Hyperparams should not change the sizes much.
@r0mainK I finished editing (staging updated). I didn't understand how to resolve your points, so please fix them in your own commit.
The post is great and should be interesting to read for everybody from regular engineers (2nd part) to ML people (1st part).
@m09 I believe your feedback will be valuable here.
@m09 This is the biggest difference between a paper and a blog post: the bitter majority of the people who read the latter could not care less about the theory, and they want to be entertained, not taught. The point about each formula in a book reducing the audience by half is true. So our assumption about the "cutting edge" posts is that those who are interested will read deeper, those who are not (90%) will ack the pics, remember the keywords and put a plus on Reddit.
Splitting blog posts into parts is usually a bad idea unless there is more than one topic in the series. We could write a post specifically about the theory, but there is one big problem: there is nobody to write it. Romain is out, myself has other important posts to write. If you know other, better links to the fundamentals - let's add them.
@r0mainK Do you want to fix the review suggestions yourself or delegate it to me? There are also points which I cannot fix myself.
@vmarkovtsev no don't worry, I've just been unable to work on it as much with school, and the cluster crashed a couple times so I couldn't work as fast as I wanted. I think by tomorrow night or possibly tonight I should have updated everything hopefully, sorry for the delay. With the new hyperparams there are some changes so you might have to do a last review once I push though.
@vmarkovtsev I don't fully agree on the role of blogs. To me they also are a nice place to vulgarize (papers are not). But since vulgarizing would require some rewriting I agree that we should keep this one as-is.
@vmarkovtsev pushed everything, didnt do anything related to this comment:
Lacks an introduction of the techno used. In the following paragraph we hear plenty of problems about techno we were not introduced to (like k8s, siva, and jgit-spark-connector). Giving a link doesn't replace telling the reader why they're used here in a few words, since most readers won't follow those links.
I think it's ready for a final review/rewrite on your side, if you want I can add this paragraph in the evening, but Im in class all day so cant do it now
@vmarkovtsev fuck I forgot to pull your commits and amended -> push forced my changes, I htink it squashed yours :/
@r0mainK no worries, as long as they are not erased
I meant erased, was hoping you still have them ?
@r0mainK No panic, in this case everything has been preserved :)
Second pass done. Staging updated. @campoy please review this monster - Romain spent 6 months on this.
cc @vcoisne
@campoy @vcoisne Friendly 1 week ping
@vcoisne scheduled for October 4th. Check out our Blog & Press schedule the dates blog post will be published.
@vcoisne I mean this needs a review from the Devrel team...
Published as https://blog.sourced.tech/post/deduplicating_pga_with_apollo/
@r0mainK Congratulations! Your internship is now officially complete :-)
@vmarkovtsev sorry for the delay, here is a first version of the post
Still missing: