Open vmarkovtsev opened 6 years ago
This should definitely have a blog post associated with it as well /cc @campoy
I thought id2vec had been renamed? any chance we could do this on TPUs on GCP? that would easily become a blog post on cloud.google.com/blog
id2vec is a cool name and straight to the point... did you mean ast2vec instead?
We can run it on TPUs since it is Tensorflow under the hood. @zurk do you wish to play?
@vmarkovtsev do we have TPUs somewhere? I did not know about that. Yes, definitely I want to try.
@zurk we have access to them via Google Cloud, in the srcd-playground project. Be aware to not get a huge bill for using them.
Rephrasing Eiso: train it on science-3, measure the time, and then try TPUs.
ok, need to collect our cooccurrence matrix first.
blocked by https://github.com/src-d/engine/issues/339
Since all ML team have a lot of problems with the Engine I pause my attempts to process all PGA dataset. Next steps are following:
Right now the engine is much better and I am able to process something. Thanks to fixes from DR team, @r0mainK's performance hacks and @smola's help.
I also compute fast and slow siva files:
I run a program which calculates how many time it is needed to process a simple command via engine for only one separate siva-file. The command is: engine.repositories.references.head_ref.commits.tree_entries.count().
If this command takes to much time to finish it is a slow siva-file. I want to exclude such files for now to speed up ML experiments. Here you can find results of my experiment:
https://gist.github.com/zurk/44c87ef6b31dff6e56a198ebd27f48e4
and the list of fast siva files (it is approximately half of the PGA dataset -- a good point to start): fast_sivas.txt.zip
My next steps:
repo2coocc
on fast Siva-files.Current status:
00
subdirectory into coocc model for Python
, Java
and Go
languages all together. :tada: Better than nothing. @zurk Do we really have the preprocessing in Apollo instead of sourced-ml? If it is so could you please move this preprocessing to sourced-ml.
I must also note to the future reader that if we did not have problems with Spark, we would not write the code to merge DFs and Cooccs.
Yeah, I also think about it. Defenetely it is time to move it. We can use siva -> parquet
approach everywhere. Can you please create another issue for this task?
PR for df part https://github.com/src-d/ml/pull/252
second PR for coocc part https://github.com/src-d/ml/pull/254 @vmarkovtsev please review when you have time.
long time since the last update. I will try to report frequently.
repo2coocc
that means 60 co-occurrence matrix and document frequencies models need to be merged. Right now I continue to process PGA via repo2coocc
.
I think to prune current data using document frequency and learn 200-dim embeddings and continue my experiments.
Very good report @zurk, thanks
From your report, it's quite obvious that the neural splitter is needed. Nice report!
@zurk status?
It was a pause in processing because I need computer resources for some other tasks. Now it is unpaused.
I process 140 from 210 PGA subdirectories. So it is 66% done.
After that, I need to merge models into one and run id2vec. I think to use our cluster to speedup coocc matrixes collection process.
Two more days and I have +14 PGA subdirectories. Today I try to lunch extraction on a cluster.
It is done. I train two models with 40 epoch and 100. they are on science-3:
/storage/konstantin/emb-0717.asdf
and /storage/konstantin/emb-0717-2.asdf
however, I run my test task for it (https://github.com/src-d/backlog/issues/1249) and find out that I have worse results than before. Now I do a more fair comparison.
It is possible to have any kind of mistakes here. So, I am trying to find out what can be wrong.
I found out a problem. It was a bug in df.greatest()
method. Here is a PR: https://github.com/src-d/ml/pull/305
@zurk please update
So, last time I had an idea that something bad with Document frequency model. I decided to take an old one from here: https://github.com/src-d/models/blob/master/docfreq/f64bacd4-67fb-4c64-8382-399a8e7db52a.md and build a new cooccurrence matrix only for tokens from both (my current and old) df model. Idia failed, results are still bad.
Next step: take a deeper look for new cooccurrence model and compare with the old one. I will look for the nearest neighbors in cooccurrence matrix space. It is a memory intensive task, so I tried to avoid it before, but now I have no choice. One more hypothesis where to look for answers is that something wrong happens when we move from 0.5 PGA to full PGA. So I want to build one more id2vec model on a random half of PGA and see which performance I can achieve. And it is not about pruning values to 2**32-1, there is only ~200 values which was pruned.
Did we ever figure this out?
@eiso No, Konstantin stopped working on this several weeks ago since we all had to work on the style-analyzer to fulfill the deadline.
Yes, it is right. Next step: subtract matrixes old one from a new one and see anomalies in diff.
@r0mainK this is all yours now.
Actually... @m09 do you think you can do this after Romain engineers a stable UAST processing pipeline? The guy already has a few tasks related to UASTs...
@vmarkovtsev since this is basically a follow-up of the identifier extraction I don't mind doing this as well
ok
Run id2vec on PGA dataset and produce the model. Publish the model with modelforge. Fix all the found bugs.
Includes updating the Dockerfile for Python3 and infra issues.