src-d / ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
Other
141 stars 44 forks source link

preprocrepos: what is this for and who is dzhigurda? #304

Open campoy opened 6 years ago

campoy commented 6 years ago

I'm reading this document and wondering what this command is for.

The description says preprocess your data before passing it to any command you need but this is too vague to be useful. What are the common use cases of the tool? Why was it created?

Finally, the last flag is dzhigurda ... is that Nikita Dhzigurda?

Nikita Dhzigurda

vmarkovtsev commented 6 years ago

The description is not updated - the real one is https://github.com/src-d/ml/blob/master/sourced/ml/__main__.py#L34 Thus we cache UASTs and/or file contents so that we do not have to extract them again for downstream tasks (especially because it is typically the trickiest and the most unreliable step).

Regarding Nikita, yep. He is a legendary Russian freak, and his surname sounds funny even for ourselves. Mail.Ru group developers (thousands of them) have an internal convention to call the conditions for A/B tests "dzhigurdas". The goal of dzhigurdas is to select the proper configuration depending on the context. I decided that it was funny to continue the tradition and used that name for the dirty hack to artificially extend the dataset in src-d/ml. So dzhigurda chooses which commits to process.

sakalouski commented 6 years ago

Is there some way to access commits from a particular date? I am trying to convert a repo of the size of 440 Mb, having 6k commits. Siva file size is 1.2 Gb, but I am wondering, what would be the size of .parquet... It takes forever on a cluster node (dzhigurda -1), then crashes - apparently 200 Gb RAM is not enough for this task.

I think, I should use gitbase for that...