vifactor / repostat

Inspired by gitstats project: git repository desktop analyzer
GNU General Public License v3.0
106 stars 13 forks source link

Report progress when fetching blame data #182

Closed pulkomandy closed 4 years ago

pulkomandy commented 4 years ago

I ran the new version of repostat on https://git.haiku-os.org/haiku/ (well, I used version 2.1.1, because 2.1.2 was not yet released when I started generating the stats :) )

It has been computing blame data for 22 hours now and is still running. It apparently runs 7 threads on my old 2 core machine and uses all available CPU. There is no progress indication so I don't know how long it will run.

vifactor commented 4 years ago

Hi, Sorry, I noticed the regression a day after the release. Yesterday, I released a patch-version which fixes the issue. Unfortunately, the blame-data fetch is indeed very slow. I found that multitheaded fetch accelerates it twice. Which, of course, not a cure for repos with lots of files. In the mean time, the progress indication is a bit complicated in multitheaded case, so for the sake of simplicity I removed it.

On the other hand I was thinking of deprecating the --contribution option in v2.3.x as the most interesting metrics is obtained from blame data. Apparently, I have to reconsider this deprecation...

I would appreciate if you wait till the end though and tell me how long did it take :)

vifactor commented 4 years ago

177 addresses the issue

pulkomandy commented 4 years ago

Yes, I'm not too much in a hurry for the stats, and that's running on my server machine so I won't power it off. I'd say this is acceptable if the results can be somehow cached so that it doesn't need to be recomputed everytime I run repostat. Otherwise I'll probably keep it disabled for such large repos.

vifactor commented 4 years ago

https://stackoverflow.com/questions/5666576/show-the-progress-of-a-python-multiprocessing-pool-map-call

pulkomandy commented 4 years ago

Well it ran for several days and apparently eventually the machine ran out of memory and killed it. The out of memory may be for other reasons, the machine is somewhat busy with other things and has just enough RAM for all of it and a rather small swap partition.

vifactor commented 4 years ago

Hi, thanks for the update. I am now curious to run repostat on your repo :)

Pandas is known for being greedy for RAM, perhaps it is worth to think about dataframes deletions when those are not needed anymore

vifactor commented 4 years ago

There is a very nice module: https://github.com/tqdm/tqdm which does what is needed.

vifactor commented 4 years ago

Started repostat on haiku repo.

History data fetch was comparably fast: ~15 mins So far not bad for blame data: 10%|▉ | 2501/25696 [19:56<2:54:03, 2.22it/s]

vifactor commented 4 years ago

Apparently, for some files in the haiku repo, blame works extremelly slowly:

filename | hunks count | time to blame (s)
[...]
src/apps/drivesetup/DiskView.cpp     222  24.87
src/apps/drivesetup/DiskView.h         17    20.56
src/apps/drivesetup/DriveSetup.cpp  43    362.5
src/apps/drivesetup/DriveSetup.h      10    356.8
src/apps/drivesetup/DriveSetup.rdef   9     251.5
[...]
vifactor commented 4 years ago

@pulkomandy , here is the time which was required to finish blame data collection on my laptop (4 cores, 8 GB RAM) for haiku repo: 100%|██████████| 25696/25696 [39:41:52<00:00, 5.56s/it]

I find again and again that the choice of pygit2 as a tool for git data processing was not ideal. This time because of https://github.com/libgit2/libgit2/issues/3027