WIP: Use NVD3 for graphs

pulkomandy commented 4 years ago

From https://github.com/bagder/gitstats/commits/master

vifactor commented 4 years ago

I generally agree with migration from png to svg, but with the current PR images look enormous on the page. I suppose you missed changes to css file that went along in original commit: https://github.com/bagder/gitstats/commit/c1c78d7f195e977dc5a16cb42bfc8a583838ae7c

pulkomandy commented 4 years ago

Yes, they use 100% of the width now. The 70% in the original was to leave space for the table on the right in the original gitstats, but that is gone now. But I agree it doesn't look great for all graphs. I think we'd need to keep different width for some of them, I'll see if I can do something like that.

pulkomandy commented 4 years ago

Proof of concept using NVD3 to generate SVG graphs. I have to convert more graphs, for now I did only "recent activity" because that was the easiest one to experiment with. I did not add much interactivity in there yet.

pulkomandy commented 4 years ago

I have added the two main graphs in the Authors page. Did a test run as usual at http://pulkomandy.tk/repostat-test (will replace it whenever I test something else). It has both the old graphs and the new ones for easy comparison (of course in the submitted code the old graphs are removed). (note that the repo I tested with has a commit with an invalid date in 2002, which leads to some strange results. Not repostat fault, it's a real problem in that git history)

That's all from me for today, there are two remaining graphs (email domains, and total number of files) that I may do later. And it would make sense to move some of the javascript code I wrote to Python, but it was quicker to experiment this way in order to determine how exactly the data needs to be preprocessed.

For both graphs in the authors page, I had to subsample the data to get acceptable performance on my system. I think the result is visually very similar, and good enough for my needs.

vifactor commented 4 years ago

Hi! The graphs look definitely much cooler than static pngs. I have a couple of concerns, though:

javascript is not my language, so for me it is a bad option to have code which I cannot fully maintain
I wonder how mature is the library you use, how well it is going to be maintained and how functional it is

I am strongly motivated to use some solution which would make it possible to have interactive graphs and in the mean time write code in Python. I did small investigation before and at that time Bokeh looked the most attractive. I have not had time to start working on that.

I appreciate you efforts, but do not feel like merging javascript code to master. I may propose you to create v.1.4.x branch where this code can go. Hope you can understand.

pulkomandy commented 4 years ago

Yes, it's fine, I'm not much of a Javascript developer myself. I don't understand everything of what I did here either.

I just did this to quickly experiment with how the graphs would look. Doing it in javascript allowed me to do that without regenerating the whole stats everytime. But it is not the right way to do things in production. I'll move more of the code to the Python side eventually, which will make the page render faster.

vifactor commented 4 years ago

You know, I have changed my mind. I do not like the gnuplot-dependency and there are much more people who understand javascript than those who use gnuplot, so if you manage to replace all graphs with NVD3, we will make it merge to master. I am not planning to work on new graphs until I finalize the transition to pandas and implement cache-functionality for performance. What you are doing should not conflict with my activity, so no big merge conflicts are expected. Still I would suggest to rebase regularly on master

pulkomandy commented 4 years ago

Yes, I've started to move most of the code to the Python side still. The Javascript should be just what's needed to show the graph from precomputed data, no heavy processing should be done there. I'll also be converting the two remaining graphs (and probably use a piechart for email domains)

vifactor commented 4 years ago

I cannot check the new graph because of

Fetch API cannot load file:///home/vifactor/output/commits_by_year_month.dat. URL scheme must be "http" or "https" for CORS request.

pulkomandy commented 4 years ago

Apparently we need to allow fetching from local files like this:

const request = await fetch('recent_activity.dat', {mode: 'no-cors'});

pulkomandy commented 4 years ago

I have the streamgraph working and fixed the issue with fetching from local files. Sample work in progress: http://pulkomandy.tk/repostat-test/authors.html

Note on the input data:

For the line and commits charts, I need sample points made of a timestamp, and the corresponding line and commit count, separated by author. Something like this:

{"key": "author1", "values": [ {"timestamp": 1234, "commits": 12, "lines" 124"}, {"timestamp": 1260, "commits": 13, "lines" 168"} ] }, {"key": "author2", "values": [ {"timestamp": 1237, "commits": 10, "lines" 120"}, {"timestamp": 1280, "commits": 11, "lines" 1687"} ] }

The timestamps for the different authors do not have to match. Currently I have one timestamp for each commit made by a given author, but for a large repository this may be too much (I'll try to generate one with 60000 commits and see if the graph is working). It should be possible to subsample (include only one every 10 commits or so), but I'm not sure what's the right strategy for this (skipping a number of commits, do something based on timestamps, or something else). I think including the whole data is good enough for now, with 50000-100000 data points it should still be reasonable, for larger repos I don't know. The timestamps are in milliseconds since epoch.

The commits and lines can be put together as above, or in separate json files. Putting it together means we need to fetch and parse only one json file for both graphs so it would be faster (it adds 1 line of Javascript to tell each graph which series it should use)

There is a visual change for these graphs: the line stops at the last commit an author made. So you can clearly see when an author stopped contributing now.

For the stream graph, we need common timestamps for all authors unlike for the others (all datapoints must be vertically aligned). I did something simple for now and generated one per week, if there was at least 1 commit in that week. But that gives incorrect results for inactive periods (because the stream graph will draw a line between the two datapoints accross the period, which is not correct). So we need one datapoint per week (or month, or any interval that makes sense here).

So, the format is:

{"key": "author1", "values": [[1234, 12], [1235, 10]]},
{"key": "author2", "values": [[1234, 5], [1235, 15]]},

Where 1234 and 1235 are timestamps (still milliseconds since epoch) and the other numbers are commit count for that author during that week. This could also be merged with the dataset for the other graphs, but then they all need to share the same timestamps, so we would get only 1-week resolution for all of them. It is still possible to put them in the same json file as separate arrays if desired.

pulkomandy commented 4 years ago

All graphs now converted and gnuplot no more needed. Still needed finetuning and polishing:

[x] Cleanup the commits per author + stream graph, they are in two separate boxes currently and it doesnt look good
[x] Ideally these two would share the same legend but I'm not sure if this is possible, I'll have a look
[x] Total line/files has a truncated legend on the right (can be solved by manually setting a margin for the graphs) and the color choice isn't great (dark blue and light blue, it should be easy to have more easily differenciated colors)
[x] Streamgraph looks a bit disappointing and the Y axis seems incorrect
[x] Make streamgraph (not plain "stacked") the default view for the streamgraph and maybe remove the two others?
[ ] Generate data for some larger repos to see how performance is when drawing really complex svg and handling larger json files
[ ] Move the javascript files in the repo and package so repostat can still be used fully offline
[x] Bars in the activity per year chart are overlapping
[x] fetch requests from local files don't work in Chrome

vifactor commented 4 years ago

Generate data for some larger repos to see how performance is when drawing really complex svg and handling larger json files

The performance you are talking about here depends on the data sampling we apply for a graph, it does not directly depend on size of an analyzed repo. This is something I am playing with (in particular for the case of authors' history). Hopefully, I will prepare a PR based on https://www.feststelltaste.de/visualize-developer-contributions-with-stream-graphs/ soon and data for authors will look like:

                                 commits count
author_datetime Adrien Destugues    Viktor Kopp
2018-08-05  0.0 8.0
2018-08-12  0.0 18.0
2018-08-19  0.0 23.0
...
2020-03-08  1.0 179.0
2020-03-15  1.0 189.0
2020-04-05  1.0 190.0
2020-04-12  1.0 198.0

Notice the date-sampling (here week, but can be adjusted). I hope nvd3 can handle such date formats otherwise backward transformation to unix epoch would be needed.

pulkomandy commented 4 years ago

I made the streamgraph use month-by-month intervals because weeks are clearly too small and make the output look noisy. It looks like month is still a bit noisy as well, so indeed using quarters as in the article may make sense. But on the other hand, for young repos with only a year or two of commits, that will not make a lot of data points.

So we may want to pick week, months or quarters depending on the project full duration, maybe? I'll go with months as a reasonable compromise for now.

vifactor commented 4 years ago

So we may want to pick week, months or quarters depending on the project full duration, maybe?

yep, that is the plan

pulkomandy commented 4 years ago

I have rebased the changes (still need to remove the use of fetch). I made the conversion to the data format expected by nvd3 and it was quite straightforward (but maybe not the most efficient way to do things, let me know).

There is, however, a problem with the way pandas samples the data ("grouping" if I understand things correctly): inactive weeks get no entry at all.The result is that the cursor for showing the values (on mouseover) will jump over these empty periods, and more annoyingly, nvd3 tries to interpolate and it creates glitches in the graphs. In particular:

Very visible in the streamgraph
The line count when generating stats for repostat itself obviously does not start at 0. There should be an empty "initial week" at the start of the data to fix this

Test repo as usual: http://pulkomandy.tk/repostat-test/authors.html

For the dateformat, pandas generates epoch dates just fine.

vifactor commented 4 years ago

still need to remove the use of fetch

this is something I would really appreciate to have ASAP, because I still haven't managed to open generated report

There is, however, a problem with the way pandas samples the data ("grouping" if I understand things correctly): inactive weeks get no entry at all.

perhaps, I had the same issue for weekday-hour activity. You do transformation from pandas datastructures to json/csv, should not be difficult to fill gaps with .get(<key>, <default>) at that point

Stream graph indeed looks quite disappointing, not the way it looks in the article I posted. There are several parameters in streamgraph setting, perhaps, it is possible to adjust something to make it look better.

Anyway, some graphs look much-much modern than they looked with gnuplot, so I will appreciate if you keep going with the PR.

vifactor commented 4 years ago

Do you still have will to work on this PR? Although there are minor issues (axis naming, lines are too thin, etc.), in principle, I am ready to merge it. Not sure if I want to have streamgraph there, though. Would you mind if I ask you to separate streamgraph into another PR, which we will hold for some time until we figure out how to make it look better?

pulkomandy commented 4 years ago

I'll look into the axes and see if I can fix the missing rows in the data somehow, that should help with the streamgraph. Also I'm considering adding the js and css from d3 and nvd3 to the assets so the generated html is standalone and can be used fully offline.

I think I can get this ready to merge by tomorrow (with fixed streamgraph if I find a way, or I'll move it to a separate commit otherwise)

pulkomandy commented 4 years ago

I have fixed the streamgraph, pushed the changes so you can see what I mean with the missing 0s. With an integration period of 1 week it does not look so great on repostats stats, month or quarter will probably give nicer results.

There is another problem in the submitted code: I'm generating with max_authors=20 and there is an exception. As you can see I moved the lines-by-authors in the same loop as the commits-by-author. But there is an author who has a commit, but no line count change (Ernesto Jimenez). As a result his name is not found in the line stats data, and there is an exception. It seems the lines added graph uses the top 20 authors sorted by lines added, so it is not always the same top 20 as the other stats. Is that intentional? If so, I'll keep separate jsons for it.

vifactor commented 4 years ago

It seems the lines added graph uses the top 20 authors sorted by lines added, so it is not always the same top 20 as the other stats. Is that intentional?

I think it was a random decision, although it makes some sense: in the graph for number of added lines, on the top might be authors who.. added most of the lines. If you want to aline it with other graphs/tables - top consists of authors did most of the commits - feel free to change.

May I ask you a favor: I have one PR ready to be merged, which makes some (not very terrible) conflicts with yours. I could fix it later, but I suppose for you those conflicts would be much easier to fix. Would you mind if I merge first?

pulkomandy commented 4 years ago

Well, the current version fails to generate anything so that's not so nice. Let me separate the data for the line-by-author graph again, and then you can merge

pulkomandy commented 4 years ago

Should be fine now :)

vifactor commented 4 years ago

I have merged, please, rebase and let us merge it!

pulkomandy commented 4 years ago

Rebased and ready to go. There is a difference in the "current year months" graph, it doesn't show the future months. I think that's actually better, but it will probably look not that great in january with a single bar. Well, we can improve it later.

Ready to merge for me, let's improve from there!

vifactor commented 4 years ago

Thank you for your great contribution! Getting rid of gnuplot makes possible to have repostat on PyPi as a pure python package now!

vifactor / repostat

WIP: Use NVD3 for graphs #94