platisd / duplicate-code-detection-tool

A simple Python3 tool to detect similarities between files within a repository
MIT License
162 stars 30 forks source link

Feature proposal: add an option to add LoC in outputs #25

Closed Cael35 closed 1 year ago

Cael35 commented 1 year ago

Hello, It would be nice to help the results analysis to add an option --with-loc that add the lines of code count for each file. The ouputs will be:

platisd commented 1 year ago

Should LOC be its own "column" or appended to the filename? :thinking:

Cael35 commented 1 year ago

you mean for the text ? I find it easy to read like this, but a column is good too. What do you like better ? And of course I volunteer to do it. :-)

platisd commented 1 year ago

If it's easier to read then let's have it as you suggested. :heavy_check_mark:

So, to begin with, I suggest there should be an optional flag for this, i.e. the behavior should stay as it currently is unless a --show-loc flag is passed to the duplicate_code_detection.py.

Not sure how you are thinking of implementing it, but please remember this tool is also used in GitHub Actions (run_action.py), so even though I don't necessarily believe this feature should also be exposed to the GitHub Actions users, the behavior there shouldn't be disrupted. Unless of course you'd like to add this to the GitHub Action too and have the users opt-in. :}

Cael35 commented 1 year ago

ok, let's start with duplicate_code_detection.py --show-loc :-) And of course, the default behavior won't change

Cael35 commented 1 year ago

I have a version but I have 2 questions. :-) (1) don't you mind if I sort the source code files ? It only has a small impact on performance, and the main advantage is that it makes the results reproducible, which could be nice to automate the tests. (2) I wrote a simple function to get the raw line count of a file (with blank lines and comments), I can have the actual LoC but it will introduce a dependency on cloc. What do you like better ?

platisd commented 1 year ago

(1) don't you mind if I sort the source code files ? It only has a small impact on performance, and the main advantage is that it makes the results reproducible, which could be nice to automate the tests.

Do you mean alphabetically? If yes, sure, go for it.

(2) I wrote a simple function to get the raw line count of a file (with blank lines and comments), I can have the actual LoC but it will introduce a dependency on cloc. What do you like better ?

Since this will be an opt-in feature, let's avoid breaking existing usage. So either implement your own, or if cloc is a python module, then only import it if the user has opted in with the --show-loc argument. In any case please make sure to update the README.md. :grin:

Cael35 commented 1 year ago

(1) yes alphabetically. (2) unfortunately cloc is a command line tools written in perl, so I keep the simple implementation and I specify in the help that it is the raw line count.