pkumza / LibRadar

LibRadar - A detecting tool for 3rd-party libraries in Android apps.
Apache License 2.0
258 stars 51 forks source link

Specifications of the .dat file formats #1

Closed IzzySoft closed 7 years ago

IzzySoft commented 7 years ago

Could you please provide some description/specification on the file formats of the data/*.dat files, especially what the fields are supposed to stand for, and which of them are mandatory? I'd have specifications on several libraries currently not covered. Currently I "overlay" those data in my "wrapper" when evaluating the JSON result. But I'd find it much better to integrate it here (and of course send you a corresponding PR then), if I could do that.

EDIT:

To give you an example, I've identified

{
 "pn": "com/getjar/sdk/vending/billing/',
 "lib": "pa;GetJar Billing;http://www.getjar.com/",
 "dn": 906,
 "cpn": "com/android/"
}

What does cpn stand for? A package Getjar depends on ("called package name")? OK, that's not needed for the .dat file. But for that it seems I'd need the values for bh (B_Hash), btc (B_Total_Call) and btn (B_Total_Number). How to identify those? Or could I simply substitute a 0 for missing values (those 3 are numeric)?

From the couple of test runs done, I could currently contribute 10 library identifications. As I'm planning to use LibRadar a bit longer, more might show up. Would be great if others could benefit from that.

pkumza commented 7 years ago

I'm sorry for not specifying these descriptions earlier. I didn't want to release LibRadar other than http://radar.pkuos.org/ at the beginning, but after several weeks, the website turned out to be slow and the API I provided for the website seems difficult to use, so I set LibRadar public again in Github. In the very beginning, LibRadar returns readable sentences and I changed them into json for Node.js, but I forgot to describe what every symbol means after the modification.

I have given a new description in README.md. If you have more questions, please let me know.

Thanks a lot for your suggestion!

IzzySoft commented 7 years ago

Thanks, that fills some gaps! But I'm still confused on adding new entries (my list meanwhile contains more than 50 libraries I could add, some with 10 entries or more). After inserting

                    "bh": self.libs_feature[mid][0],
                    "btn": self.libs_feature[mid][1],
                    "btc": self.libs_feature[mid][2],

before line 319 in detect.py I get all values reported for the "missing/unknown" library. But adding a corresponding line at the end of data/tgst5.dat – in my example,

{"dn": 177, "lib": "da;EBookDroid;https://github.com/mortenpi/ebookdroid/", "sp":"org/ebookdroid/ui/library/tasks/", "bh": 32868, "btc": 22, "btn": 7, "pn":"org/ebookdroid/ui/library/tasks/"}

and running LibRadar again, it's still not recognised. Values I've got reported:

{"dn":177,"p":[],"sp":"org/ebookdroid/ui/library/tasks","cpn":"org/ebookdroid/ui/library/tasks\/","btc":22,"pn":"org/ebookdroid/ui/library/tasks/","btn":7,"bh":32868}

So where am I going wrong here?

_Edit:_ Figured. Dummy entries before (with missing values). OK, time for a fork+clone :)

IzzySoft commented 7 years ago

BTW: While waiting for your response to my PR, I wanted to proceed on my project – which is using PHP. I wrote a "wrapper class" which calls the (unmodified) LibRadar main.py, evaluates the results, and allows to "append" additional values (i.e. where LibRadar had a match but no details). The latter can be done with "exact matches" and also with "wildcard matches" (i.e. specifying "pn":"org/foobar/" would match any org/foobar/* hits). Specifications use a similar structure as your .dat files do (just fewer fields, as it doesn't do any "binary matching").

If you're interested, I could add that (incl. a short description) as separate directory (e.g. php/*) with another PR – once #2 is closed.

pkumza commented 7 years ago

Thanks a lot for your effort, and it really helps. You know, it must be a tough work to tag so many libraries in database.

I wrote a script for tagging and filtering the libraries before. Every library would go through a 5-level pipeline before it appeared in tgst5.dat. PS: 'tg' means Tagged and 'st' means Stage. =) However the code were somewhat ugly. There's some experimental things, so I didn't commit them.

IzzySoft commented 7 years ago

Yeah. I checked each line manually. Not being an Android dev (and hence not having the corresponding environment set up), I couldn't do any "binary checks". What I did was checking the project I matched whether it had the given module. And I only picked the "obvious ones". Hence some "Spongy Castle" (org/spongycastle/*) entries remained unmatched, as I couldn't find the module (pn) in the project tree.

Thanks for the merge! Do you want to have my PHP library added to your project (I'd then of course maintain that, as I'm using it myself)?

pkumza commented 7 years ago

Hi, Izzy. I'm glad to inform you that LibRadar Version2 is ready, which has better hashing algorithm, full and convincing API list, incremental database, partial match support, using dex other than apktool to analyse apps, etc.

IzzySoft commented 7 years ago

Cool! And good to see you're still on it. Will take a look as soon as time permits (am quite busy with several other projects at the moment). If there are any hints concerning updating from the previous version, and what has changed (so I don't break my stuff), they are welcome – e.g. have there been changes in the output format (the QuickStart.md looks like it switched from JSON to some plain-text format, and no longer contains the permissions), or the way it has to be called. And is redis a (new) hard requirement now – or can LibRadar still be used stand-alone?

Any progress on adding new libraries? Maybe even detecting "possible unknown libraries" and reporting them? You remember: while I've updated a ton of entries, I wasn't able to add new ones – e.g. where I knew a certain library was used by a given app but not reported by LibRadar.

pkumza commented 7 years ago

Very constructive suggestions! Thanks!

I've change all the code and rebuild this project to support "incremental data set", and that's why I use redis as my database. Now I could add new apps any time as a source for detecting new libraries. It is feasible to add a certain new library of a given app in this system. However it have not been implemented so far. I will add this functionality later.

Redis database is used as a giant hash-table. It's an important part for incremental data set support. I have the plan to create a Lite version without database.

JSON and permission support will be added soon.

IzzySoft commented 7 years ago

Thanks for the update! As lack of JSON and permissions would break my current scripts, I'll have to wait at least until that is implemented. Redis ships from the repos as far as I could see (packages redis-server and redis-tools), so it shouldn't be too hard to meet that requirement if I must :) However, with the changed structures it's not yet clear to me how to contribute library updates then. Maybe that will become clear once I've installed the updates.

So fingers crossed and heads-up for your ongoing work!

pkumza commented 7 years ago

Yeah, it shouldn't be hard to install and config Redis. ^_^