smontanari / code-forensics

A toolset for code analysis and report visualisation
384 stars 45 forks source link

Hotspot analysis doesn't work after merging two Git repositories. #5

Closed mhenrichsen closed 6 years ago

mhenrichsen commented 7 years ago

Hi @smontanari,

I've tried this method to merge two repositories without messing with the history.

It seems to work fine, but when I run the hotspot analysis, there is no hotspots - all dots have the same color.

Here is a picture.

Any ideas?

smontanari commented 7 years ago

The symptom seems to indicate that your analysis results do not contain any complexity and/or lines of code metrics (depending on what diagram you're trying to visualise). It could be just a matter of configuring code-forensics the right way. If your code is in an open repo and you let me know which one I can give it a quick try and let you know

dlehammer commented 6 years ago

The symptom described by @mhenrichsen seems to match my experience with git subtree. I've also attempted to combine multiple git repositories into a single repository with each repository represented by a sub-directory, in-order to perform code-forensic on the full code-base. I've created an example repo here. The repo was created as follows (using Git v2.14.2):

$ mkdir code-forensics_git_subtree_issue-5 && cd code-forensics_git_subtree_issue-5

code-forensics_git_subtree_issue-5$ git init

code-forensics_git_subtree_issue-5$ echo "Workaround for fatal: /usr/lib/git-core/git-subtree cannot be used without a working tree." > dummy.txt

code-forensics_git_subtree_issue-5$ git add .

code-forensics_git_subtree_issue-5$ git commit --message="Workaround for fatal: /usr/lib/git-core/git-subtree cannot be used without a working tree."

code-forensics_git_subtree_issue-5$ git subtree add --prefix=grails-cache https://github.com/grails-plugins/grails-cache.git master

code-forensics_git_subtree_issue-5$ git subtree add --prefix=grails-quartz https://github.com/grails-plugins/grails-quartz.git master

Executing git log results in a promising result, my 3 commits + the full history code-forensics_git_subtree_issue-5_git_log

Unfortunately $ gulp hotspot-analysis --dateFrom=2007-05-07 doesn't provide the expected result. code-forensics_git_subtree_issue-5_hotspot_analysis

Ie. it seems the number of commits are equal across the repo, how can that be... A visual representation gives a clue. code-forensics_git_subtree_issue-5_visual_git_log

Hmm, further probing provide the following clue for a file with 100+ commits in the original repository. code-forensics_git_subtree_issue-5_visual_git_log_for_file

I've tried to research the underlying issue, ie. retrieving the full history via git log, this is the most informative discussion I've found.

Conclusion; this seems to be a git issue.

dlehammer commented 6 years ago

Hotspot analysis works as expected for a single git-repository, in this example grails-quartz. screenshot from 2017-10-13 15-42-37

dlehammer commented 6 years ago

I couldn't let this issue go, as additional digging revealed there's several approaches for merging git repositories into a single git repository in separate sub-directories without loosing file history. And as described below I'm suspecting there's an issue here and perhaps it's possible to tease out a fix

I've tried several approaches, they all seem to produce full history for git log but the result is seemingly incompatible with code-forensics, for example git-merge-repos, git-stitch-repo, How do you merge two Git repositories? etc..

Common for the above approaches is that code-forensics doesn't produce "Revision churn level" above 1 and the log output changes when run on merged repositories, I've uploaded an git-merge-repos example here.

When executing code-forensics on a single git repository as described in above comment. The following bolded line is present in the log.

... Starting 'vcs-log-dump'... Fetching git log from 2015-01-01 to 2017-10-17 ...

When executing code-forensics on a merged git repository, example, the bolded line above is omitted! And instead the following bolded line is present in the log.

... Starting 'hotspot-analysis'... Can't determine weight of collection. Assigning a value of 0 to every item. ...

Regardless, the full file history seems to be present as far as git is concerned

screenshot from 2017-10-17 15-58-37

Utilizing git log on the same example as above, provides the full file history as expected.

screenshot from 2017-10-17 15-59-44

I've tried digging around in the source for code-forensics, and found the following in git_adapter.js and I expected removing/changing the flag would solve this issue.

gitlog_analysis: ['log', '--all', '--numstat', '--date=short', '--no-renames', ...

Unfortunately my limited experience with node.js has blocked attempts to determine why this line is only executed for single git repository.

dlehammer commented 6 years ago

In-order to support reproducability, the git-merge-repos example was created using the following steps:

  1. clone repositories
    
    ~/tmp$ git clone --mirror https://github.com/grails-plugins/grails-cache.git

~/tmp$ git clone --mirror https://github.com/grails-plugins/grails-quartz.git


2. move content into sub-directory

~/tmp/grails-cache.git$ git filter-branch --index-filter \ 'tab=$(printf "\t") && git ls-files -s --error-unmatch . >/dev/null 2>&1; [ $? != 0 ] || (git ls-files -s | sed "s~$tab\"*~&grails-cache/~" | GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info && mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE")' \ --tag-name-filter cat \ -- --all

~/tmp/grails-quartz.git$ git filter-branch --index-filter \ 'tab=$(printf "\t") && git ls-files -s --error-unmatch . >/dev/null 2>&1; [ $? != 0 ] || (git ls-files -s | sed "s~$tab\"*~&grails-quartz/~" | GIT_INDEX_FILE=$GIT_INDEX_FILE.new git update-index --index-info && mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE")' \ --tag-name-filter cat \ -- --all


3. merge repositories

~/tmp$ git clone https://github.com/robinst/git-merge-repos.git

~/tmp/git-merge-repos$ ./run.sh /home/dleh/tmp/grails-cache.git:. /home/dleh/tmp/grails-quartz.git:. Started merging 2 repositories into one, output directory: /home/dleh/tmp/git-merge-repos/merged-repo ... Done, took 4437 ms Merged repository: /home/dleh/tmp/git-merge-repos/merged-repo dleh@nine-deh:~/tmp/git-merge-repos$



The resulting merged repository can be found in ```~/tmp/git-merge-repos/merged-repo```, see also [example](https://github.com/dlehammer/code-forensics_git_merge-repos_issue-5). 
smontanari commented 6 years ago

thank you @dlehammer for the extensive report on your attempts to dig into this.

code-forensics needs to produce a git log in a particular format in order to be parseable by code-maat. the code in the module git_adapter.js simply wraps git commands and streams the output back to the program so it can be parsed accordingly. In particular, the repo log/history information necessary for many of the analyses is retrieved through a git log command with the parameters you identified already, i.e.:

git log --all --numstat --date=short --no-renames --pretty=format='--%h--%ad--%an'

Have you tried to manually run this command in your merged git repo? Is the output of this command different from when it's executed on a normal repository?

dlehammer commented 6 years ago

git log --all --numstat --date=short --no-renames --pretty=format='--%h--%ad--%an'

Have you tried to manually run this command in your merged git repo? Is the output of this command different from when it's executed on a normal repository?

As far as I can tell, the output format is identical between repositories and both produce output. git log ... diff

My main suspect is still that the command isn't executed because some error blocks the flow at an earlier stage, as described in above comment.

When executing code-forensics on a single git repository as described in above comment. The following bolded line is present in the log.

... Starting 'vcs-log-dump'... Fetching git log from 2015-01-01 to 2017-10-17 ...

When executing code-forensics on a merged git repository, example, the bolded line above is omitted! And instead the following bolded line is present in the log.

... Starting 'hotspot-analysis'... Can't determine weight of collection. Assigning a value of 0 to every item. ...

smontanari commented 6 years ago

@dlehammer I cloned your example repo and I had no problem running the hotspot-analysis ha-screenshot

This is an extract of the output

[15:02:52] Using gulpfile ~/temp/merged-git_code_forensics/gulpfile.js
[15:02:52] Starting 'sloc-report'...
...
[15:02:52] Finished 'sloc-report' after 172 ms
[15:02:52] Starting 'code-stats-reports'...
[15:02:52] Finished 'code-stats-reports' after 58 μs
[15:02:52] Created: vcslog_normalised_2015-01-01_2017-10-17.log
[15:02:52] Finished 'vcs-log-dump' after 113 ms
[15:02:52] Starting 'revisions-report'...
[15:02:56] Finished 'revisions-report' after 3.84 s
[15:02:56] Starting 'hotspot-analysis'...
[15:02:56] Generating report file 2015-01-01_2017-10-17_revisions-hotspot-data.json
[15:02:56] Open the following link to see the results:
[15:02:56] http://localhost:3000/index.html?reportId=7b228da598b9213fc610a513f7be9f3e2da49fbe
[15:02:56] Finished 'hotspot-analysis' after 23 ms

I also checked the vcs log file produced and they look ok. Maybe you're missing something in your setup? I suggest you try to execute the analysis with the COMMAND_DEBUG=1 env variable and maybe the more verbose output would show more information

dlehammer commented 6 years ago

Well, this is a bit unsettling, but I'm able to run the analysis successfully for the example repo - just like you now.

I suspect something's changed in my environment since last time the symptom manifested itself, but I haven't taken any active steps in this regard myself - hence I'm unable to tell what's affected the outcome.

This is my current setup, as best as I can gather:

Thank you for your patience :+1:

smontanari commented 6 years ago

No worries, I'll close the issue for now