src-d / hercules

Gaining advanced insights from Git repository history.
Other
2.63k stars 334 forks source link

Reconsider bold statement #352

Closed p3k closed 4 years ago

p3k commented 4 years ago

In the README it says:

what git-of-theseus does but much faster.

This is a claim I cannot support from my experience using both, git-of-theseus and hercules, with the same repo (granted which is huge).

If you need numbers, I’ll provide them. Just let me know which ones you are interested in.

vmarkovtsev commented 4 years ago

Well, that was backed with a benchmark on hundreds of OS repos from GitHub back in 2017. Since then, we tested with roughly 300k most-starred repos to great success. Perhaps your repo is an outlier that contains nasty edge cases that Hercules does not handle well by default (like adding and removing thousands of files in one commit). I can add to the claim smth like "... does but much faster except for one proprietary monster though nobody can verify " 😂

Now seriously: I can try to help you with identifying the particular bottleneck in Hercules and mitigating it. Please run it with --profile on a subset of --commits that is not painful to wait for.

p3k commented 4 years ago

thanks for the reply, going to run the command with the flags presumably tomorrow.

in the meantime i would be curious about verifying the benchmark results of the hundreds of OS repos from GitHub as well as the 300k most-starred repos. could you provide a link?

p3k commented 4 years ago

btw. after approx. 2 hours running the hercules command finished. now i am running labours and all i get is the message Reading the input... and a blinking cursor – am i doing this right?

vmarkovtsev commented 4 years ago

Regarding the links: this was internal to source{d} that is dead nowadays. To be precise, we ran Hercules over PGA. It was very painful and took months, I must say.

Regarding labours, if you did not specify --pb, it must be trying to parse a huge YAML. Depending on whether you are running in docker or not, PyYAML is probably defaulting to the pure Python parser, and that guy is really slow. If it was with --pb then indeed there is much data.

BTW 2 hours is a big success if your repo is huge. It takes no less than 6 hours for Tensorflow, which we consider moderately sized. How many commits?

p3k commented 4 years ago

Regarding the links: this was internal to source{d} that is dead nowadays. To be precise, we ran Hercules over PGA. It was very painful and took months, I must say.

oh too bad, but understandably nonetheless.

Regarding labours, if you did not specify --pb, it must be trying to parse a huge YAML. Depending on whether you are running in docker or not, PyYAML is probably defaulting to the pure Python parser, and that guy is really slow. If it was with --pb then indeed there is much data.

ah ok learning here. is labours -f pb what you mean? (--pb seems to be recognized by hercules only.) should i also run hercules --pb then?

BTW 2 hours is a big success if your repo is huge. It takes no less than 6 hours for Tensorflow, which we consider moderately sized.

oh ok i see. did you try git-of-theseus with tensorflow? :joy_cat:

How many commits?

21154

p3k commented 4 years ago

still not sure i am doing this right… the readme says to issue these two commands for the project burndown:

hercules --burndown
labours -m burndown-project

could it be i need to either combine both commands via pipe or temporarily save the hercules output, resp.?

vmarkovtsev commented 4 years ago

The recommended flow is:

hercules --pb >results.pb
labours -i results.pb

Ofc you can pipe hercules --pb | labours but as soon as you want to try different plotting parameters you'll have to wait another 2 hours. If the repo was small, any way would be OK, even with YAML. Now that you've got YAML, there is no converter to PB, so either re-run hercules with --pb or give Python some time.

Yeah, I need to run theseus on Tensorflow, a good idea.

2 hours for 20k look normal. There are nasty repos from NDA clients which take days if the cmdline arguments are not tuned.

p3k commented 4 years ago

ok so i now got some nice burndown and ownership charts, thanks for the assistance.

regarding the latter, is it true i have to rerun hercules whenever i change the people dictionary? wouldn’t it be more efficient (if at all possible) to apply those entries when running labours?

vmarkovtsev commented 4 years ago

is it true i have to rerun hercules whenever i change the people dictionary

There is --exact-signatures but the support for merging them according to a specific identity dictionary is not implemented in labours yet. PRs welcome :smile:

I am happy that I could help :+1: Shall I close?

p3k commented 4 years ago

gonna do that for you :smile_cat: