platisd / duplicate-code-detection-tool

A simple Python3 tool to detect similarities between files within a repository
MIT License
162 stars 30 forks source link

feature request: Ignore comments #8

Closed iwishiwasaneagle closed 2 years ago

iwishiwasaneagle commented 2 years ago

I'm getting high similarity scores, even though when I analyze the files with grep -Fxf FILE1 FILE2 it's just comments. It would be nice to ignore comments.

platisd commented 2 years ago

It sounds like a nice idea, some remarks I'd like to hear your opinion:

  1. Should comments be excluded from duplicate checks? Aren't they part of the code? I am referring mostly to meaningful comments and not pre-made headers, e.g. license text, that may be on the top or the bottom of the file.
  2. The programming language needs to be taken into consideration to determine what constitutes a comment.
iwishiwasaneagle commented 2 years ago

It sounds like a nice idea, some remarks I'd like to hear your opinion:

  1. Should comments be excluded from duplicate checks? Aren't they part of the code? I am referring mostly to meaningful comments and not pre-made headers, e.g. license text, that may be on the top or the bottom of the file.

For me it's mainly docstrings. They can be the largest contributor to the line count. For example, before my PR (#9) I had two files with 70% score and afterwards it was 0%. This is because both files contained a single class that inherited from a base class and I caried the docstrings over with minor modifications.

  1. The programming language needs to be taken into consideration to determine what constitutes a comment.

Hadn't even considered 2... Well my PR is completely self-indulgent. Maybe a check to see if it's python or not, then can implement something similar for the other supported languages

platisd commented 2 years ago

Yeap, I guess 1 is taken care of in your pull request.

Regarding 2 I wonder if we can do something smart here so we don't have to manually add support for languages. We currently advertise that C, C++, JAVA, Python and C# are supported, which isn't much more than looking for specific file extensions. Do you have any idea whether remove_comments_and_docstrings can parse languages other than python?

iwishiwasaneagle commented 2 years ago

I believe it's only Python. From what I can tell, it's generally better to use the language's equivalent to ast. Anything 3rd party will apparently be quite inferior... I guess a workaround could be removing lines via regex. So for python sed '/^\s*#+/d would delete any comment line. The problem then becomes detecting docstrings which span multiple lines.

iwishiwasaneagle commented 2 years ago

See https://github.com/platisd/duplicate-code-detection-tool/pull/9/files#r838504071

platisd commented 2 years ago

I can imagine that regex isn't the cleanest solution indeed. :thinking: OK, for now we can document that we only support this feature on python as well as what you just suggested. Can you also update the README file examples?

iwishiwasaneagle commented 2 years ago

Updated README in 1e6183b