open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.56k stars 219 forks source link

Language-agnostic open source library fingerprinting solution #258

Open KOLANICH opened 3 years ago

KOLANICH commented 3 years ago

Project description

There are lot of libraries and there are lot of software using them. Sometimes known open-source libs used in software are minorly customized. We wanna know which libraries were used, which versions of them and what pieces of the code were changed.

The usual approach for this is extracting some features from code (control flow graph, signatures) and matching them against the database.

There are some free open-source solutions for that. Unfortuantely almost all of them are for Java for Android since this platform is highly affected by bunling libs and uses bytecode that eases feature extraction. We need to abstract the existing solutions enough to allow them be easily adapted to any programming language (i.e. python, javascript, C++ (using retdec as a decompiler) , C# (using any .net decompiler)) for which we can extract AST.

Relevant Technology

Complexity and required time

Complexity

Required time (ETA)

Categories

TheOtterlord commented 3 years ago

Hm, git could be useful here. You could compare the extracted library with the original at the chosen point in time (probably the version release tag). Since git does not care what language you are comparing, you could see the exact differences. Anyone know of any pitfalls that would prevent an implementation like this?

KOLANICH commented 3 years ago

git is useful here to check out versions only. The problem is detecting the modifications to a known open-source lib without having their source and without any clue to the exact version that was used as a base. I.e. someone used a foss lib, embedded it into own software, but did some modifications, so when used with the upstream version, the software fails. Then he used a optimizer, so the decompiled source doesn't strictly match the original one. Most the original symbols names are lost. CGF is a bit distorted too - optimizer decided that this way it would be a bit faster. Some functions are inlined.