pkumza / LibRadar

LibRadar - A detecting tool for 3rd-party libraries in Android apps.
Apache License 2.0
258 stars 51 forks source link

Source of data/tgst5.dat #15

Closed bichselb closed 7 years ago

bichselb commented 7 years ago

This is not really an issue, but rather a question. I hope you don't mind if I ask it here.

For a project of mine, I am interested in knowing which libraries come with which package names, and possibly also some meta-information on these libraries. I noticed you have an awesome and fairly complete list indata/tgst5.dat. How did you obtain it? I am assuming you have some high-quality source that you parsed?

pkumza commented 7 years ago

Good question. In fact, LibRadar can be divided into two parts including 'clustering' part and 'instant detection' part. I downloaded more than 1 million apps and extract static features from them. Then I clustered them into groups and record the groups that have more than 1000 items, and take them as Lib. Some volunteers and I tagged some items so we got tgst5.dat. You can refer to LibRadar - ICSE 2016 for more details. I don't think that is there anyone want to use the 'clustering' part code to do this work again because it costs a dozen servers for a month to create these data. At the same time, the code I used are ugly and they are just like patches and patches = _ =. Therefore, I released the instant detection part onto my github as LibRadar. I hope that's enough for users.

bichselb commented 7 years ago

Ok. Thank you so much for you fast reply!

jevinskie commented 7 years ago

I just found your awesome project this week. Separating first party and third party code is exactly what I have been looking for! My previous whitelist approach, as your paper clearly shows, is a losing approach.

I am very interested in the clustering code, however unpolished, since it would allow myself and others to extend and maintain the instant detection database. It would be very cool if there was a way to capture the manual part of the tagging under version control so it can be reused, extended, and updated in a collaborative fashion. At least for my use case, dozens of servers are not a disqualifying requirement.

pkumza commented 7 years ago

@jevinskie Glad to hear that. Project https://github.com/pkumza/lib-detector is the way to generate raw_data. Unpolished though and difficult to use. In dev branch in https://github.com/pkumza/LibRadar, I used 5 steps to filter and tag raw_data into tgst5.dat.

By the way, APK files I used to generate data are becoming old and this approach is losing coverage too. Therefore, I am trying to create a new version of LibRadar to update the data automatically as I put new apps into this machine.