[id2vec] Run id2vec on Public Git Archive

vmarkovtsev commented 6 years ago

Run id2vec on PGA dataset and produce the model. Publish the model with modelforge. Fix all the found bugs.

Includes updating the Dockerfile for Python3 and infra issues.

eiso commented 6 years ago

This should definitely have a blog post associated with it as well /cc @campoy

campoy commented 6 years ago

I thought id2vec had been renamed? any chance we could do this on TPUs on GCP? that would easily become a blog post on cloud.google.com/blog

vmarkovtsev commented 6 years ago

id2vec is a cool name and straight to the point... did you mean ast2vec instead?

We can run it on TPUs since it is Tensorflow under the hood. @zurk do you wish to play?

zurk commented 6 years ago

@vmarkovtsev do we have TPUs somewhere? I did not know about that. Yes, definitely I want to try.

eiso commented 6 years ago

@zurk we have access to them via Google Cloud, in the srcd-playground project. Be aware to not get a huge bill for using them.

vmarkovtsev commented 6 years ago

Rephrasing Eiso: train it on science-3, measure the time, and then try TPUs.

zurk commented 6 years ago

ok, need to collect our cooccurrence matrix first.

zurk commented 6 years ago

blocked by https://github.com/src-d/engine/issues/339

zurk commented 6 years ago

Since all ML team have a lot of problems with the Engine I pause my attempts to process all PGA dataset. Next steps are following:

Use siva from PGA files with size less than 50 Mb.
And will work with next issues about toy problem: https://github.com/src-d/backlog/issues/1248

zurk commented 6 years ago

Right now the engine is much better and I am able to process something. Thanks to fixes from DR team, @r0mainK's performance hacks and @smola's help.

I also compute fast and slow siva files: I run a program which calculates how many time it is needed to process a simple command via engine for only one separate siva-file. The command is: engine.repositories.references.head_ref.commits.tree_entries.count().If this command takes to much time to finish it is a slow siva-file. I want to exclude such files for now to speed up ML experiments. Here you can find results of my experiment: https://gist.github.com/zurk/44c87ef6b31dff6e56a198ebd27f48e4

and the list of fast siva files (it is approximately half of the PGA dataset -- a good point to start): fast_sivas.txt.zip

My next steps:

Extract cooccurrence matrix via repo2coocc on fast Siva-files.
Calculate embeddings for this matrix.

zurk commented 6 years ago

Current status:

I decide to use Apollo preprocess to speed up cooccurrence matrix collection. So, I ask @r0mainK to run it on a cluster, since it is related to his current task: https://github.com/src-d/backlog/issues/1196#issuecomment-388740509 Now almost all PGA subdirectories were preprocessed.
I was able to convert parquet files from 00 subdirectory into coocc model for Python, Java and Go languages all together. :tada: Better than nothing.
However, now I need to write code to merge Document Frequency models and Cooccurrence models. As discussed with @vmarkovtsev I use old code for Document Frequency model because it is small enough to be processed by one PC. And I write new code for Cooccurrence model using Spark.

vmarkovtsev commented 6 years ago

@zurk Do we really have the preprocessing in Apollo instead of sourced-ml? If it is so could you please move this preprocessing to sourced-ml.

vmarkovtsev commented 6 years ago

I must also note to the future reader that if we did not have problems with Spark, we would not write the code to merge DFs and Cooccs.

zurk commented 6 years ago

Yeah, I also think about it. Defenetely it is time to move it. We can use siva -> parquet approach everywhere. Can you please create another issue for this task?

zurk commented 6 years ago

PR for df part https://github.com/src-d/ml/pull/252

zurk commented 6 years ago

second PR for coocc part https://github.com/src-d/ml/pull/254 @vmarkovtsev please review when you have time.

zurk commented 6 years ago

long time since the last update. I will try to report frequently.

Around 60 of 255 PGA subdirectories were processed by repo2coocc that means 60 co-occurrence matrix and document frequencies models need to be merged.
They were merged with code in PRs, mentioned above. As a result, I have matrix around 700k x 700k.
Running swivel was a problem for several days. Loss just hit nan values and it was hard to understand why. Finally, the reason was found. I just had an overflow of int32 during the merge process. It was fixed. And everything works well after that.
As usual local and minor improvements of sourced-ml were done.
I run simple experiments as @vmarkovtsev did with legacy embeddings to compare. Here you can find a result https://gist.github.com/zurk/df7cf66818e11271934581674128eeeb @vmarkovtsev please, review when you have time. It works okayish. I cannot see so good results as I see with legacy embeddings, but some relations can be observed. My thoughts why not so good:
1. embeddings dimension is 300 instead of 200, so it is more place for noise/overfitting
2. No filtering by document frequency, so there is more noise.
3. As soon as we use our neural splitter, it should give us much better subidentifiers itself.

Right now I continue to process PGA via repo2coocc. I think to prune current data using document frequency and learn 200-dim embeddings and continue my experiments.

vmarkovtsev commented 6 years ago

Very good report @zurk, thanks

eiso commented 6 years ago

From your report, it's quite obvious that the neural splitter is needed. Nice report!

vmarkovtsev commented 6 years ago

@zurk status?

zurk commented 6 years ago

It was a pause in processing because I need computer resources for some other tasks. Now it is unpaused.
I process 140 from 210 PGA subdirectories. So it is 66% done.

After that, I need to merge models into one and run id2vec. I think to use our cluster to speedup coocc matrixes collection process.

zurk commented 6 years ago

Two more days and I have +14 PGA subdirectories. Today I try to lunch extraction on a cluster.

zurk commented 6 years ago

It is done. I train two models with 40 epoch and 100. they are on science-3: /storage/konstantin/emb-0717.asdf and /storage/konstantin/emb-0717-2.asdf however, I run my test task for it (https://github.com/src-d/backlog/issues/1249) and find out that I have worse results than before. Now I do a more fair comparison.

It is possible to have any kind of mistakes here. So, I am trying to find out what can be wrong.

zurk commented 6 years ago

I found out a problem. It was a bug in df.greatest() method. Here is a PR: https://github.com/src-d/ml/pull/305

vmarkovtsev commented 6 years ago

@zurk please update

zurk commented 6 years ago

So, last time I had an idea that something bad with Document frequency model. I decided to take an old one from here: https://github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md and build a new cooccurrence matrix only for tokens from both (my current and old) df model. Idia failed, results are still bad.

Next step: take a deeper look for new cooccurrence model and compare with the old one. I will look for the nearest neighbors in cooccurrence matrix space. It is a memory intensive task, so I tried to avoid it before, but now I have no choice. One more hypothesis where to look for answers is that something wrong happens when we move from 0.5 PGA to full PGA. So I want to build one more id2vec model on a random half of PGA and see which performance I can achieve. And it is not about pruning values to 2**32-1, there is only ~200 values which was pruned.

eiso commented 6 years ago

Did we ever figure this out?

vmarkovtsev commented 6 years ago

@eiso No, Konstantin stopped working on this several weeks ago since we all had to work on the style-analyzer to fulfill the deadline.

zurk commented 6 years ago

Yes, it is right. Next step: subtract matrixes old one from a new one and see anomalies in diff.

vmarkovtsev commented 5 years ago

@r0mainK this is all yours now.

vmarkovtsev commented 5 years ago

Actually... @m09 do you think you can do this after Romain engineers a stable UAST processing pipeline? The guy already has a few tasks related to UASTs...

r0mainK commented 5 years ago

@vmarkovtsev since this is basically a follow-up of the identifier extraction I don't mind doing this as well

vmarkovtsev commented 5 years ago

ok

src-d / ml-backlog

[id2vec] Run id2vec on Public Git Archive #17