platisd / duplicate-code-detection-tool

A simple Python3 tool to detect similarities between files within a repository
MIT License
162 stars 30 forks source link

Output as a CSV? #14

Closed davesgonechina closed 2 years ago

davesgonechina commented 2 years ago

Thinking about being able to compare SQL used by data analysts and present similarity results in a familiar form e.g. a table.

platisd commented 2 years ago

Are you interested in the tool itself or the output of the Github Action?

davesgonechina commented 2 years ago

Good question! Probably a Github Action so I can dump to a bucket or another repo and read the CSV as an external table or dbt seed.

platisd commented 2 years ago

A github action mostly makes sense when you want to automate a process though. To me it sounds like the ones interested in a CSV would load it up to a database so to further investigate/analyze the data, right?

davesgonechina commented 2 years ago

Correct, basically a feedback loop where I run duplicate-code-detection-tool on a repo full of SQL files to produce CSV output that becomes a table that the people writing all that SQL can then query in SQL to see how DRY or not their SQL codebase is as a whole.

platisd commented 2 years ago

I see. Then sounds like a feature request for a new argument to the python script to produce a CSV file. I'll take a look. 👍

platisd commented 2 years ago

I am looking into this and not sure how the csv file should look like.

Considering the output is like this: sample output

What would be the "columns" of the CSV file? Or would you expect that there's one csv file generated for every source code file?

platisd commented 2 years ago

I can imagine two options:

  1. Either a separate csv output for every source code file
  2. A single csv with three columns [Source code file] [Source code file to check against] [Similarity]. (:warning: There will be a lot of duplicated information this way)

I am not sure which one would be more usable and convenient though. :thinking:

davesgonechina commented 2 years ago

That's a good question - I can see cases for both. In my use case, I guess what I ultimately need is a deduplicated #2, but I could easily DISTINCT dupes away since we would make it available as a SQL table.

platisd commented 2 years ago

I can go with that for now :) :+1:

platisd commented 2 years ago

@davesgonechina what do you think of #16?

When running the tool for the smartcar_shield project, I get an output.csv file that looks like this:

File A,File B,Similarity src/car/smart/SmartCar.cpp,src/Smartcar.h,5.74 src/car/smart/SmartCar.cpp,src/sensors/distance/ultrasound/ping/SR04.cpp,0.39 src/car/smart/SmartCar.cpp,src/motor/digital/servo/ServoMotor.cpp,0.05 src/car/smart/SmartCar.cpp,src/car/distance/DistanceCar.cpp,17.91 src/car/smart/SmartCar.cpp,src/sensors/distance/infrared/analog/InfraredAnalogSensor.cpp,0.94 src/car/smart/SmartCar.cpp,src/sensors/heading/gyroscope/GY50.cpp,0.47 src/car/smart/SmartCar.cpp,src/car/heading/HeadingCar.cpp,57.91 src/car/smart/SmartCar.cpp,src/car/simple/SimpleCar.cpp,15.69 src/car/smart/SmartCar.cpp,src/runtime/arduino_runtime/ArduinoRuntime.cpp,0.08 src/car/smart/SmartCar.cpp,src/sensors/odometer/interrupt/DirectionalOdometer.cpp,1.15 src/car/smart/SmartCar.cpp,src/sensors/distance/infrared/analog/sharp/GP2Y0A02.cpp,1.08 src/car/smart/SmartCar.cpp,src/sensors/odometer/interrupt/DirectionlessOdometer.cpp,0.75 src/car/smart/SmartCar.cpp,src/control/differential/DifferentialControl.cpp,1.27 src/car/smart/SmartCar.cpp,src/sensors/distance/infrared/analog/sharp/GP2D120.cpp,1.07 src/car/smart/SmartCar.cpp,src/sensors/distance/ultrasound/i2c/SRF08.cpp,0.21 src/car/smart/SmartCar.cpp,src/sensors/distance/infrared/analog/sharp/GP2Y0A21.cpp,1.03 src/car/smart/SmartCar.cpp,src/control/ackerman/AckermanControl.cpp,0.18 src/car/smart/SmartCar.cpp,src/motor/analog/pwm/BrushedMotor.cpp,0.74 src/Smartcar.h,src/car/smart/SmartCar.cpp,5.74 src/Smartcar.h,src/sensors/distance/ultrasound/ping/SR04.cpp,4.58 src/Smartcar.h,src/motor/digital/servo/ServoMotor.cpp,2.2 src/Smartcar.h,src/car/distance/DistanceCar.cpp,13.05 src/Smartcar.h,src/sensors/distance/infrared/analog/InfraredAnalogSensor.cpp,1.32 src/Smartcar.h,src/sensors/heading/gyroscope/GY50.cpp,4.92

The command I ran was: python3 projects/duplicate-code-detection-tool/duplicate_code_detection.py -d src/ --project-root-dir projects/smartcar_shield --csv-output output.csv

davesgonechina commented 2 years ago

Not bad! Thanks!