pkumza / LibRadar

LibRadar - A detecting tool for 3rd-party libraries in Android apps.
Apache License 2.0
258 stars 51 forks source link

How to add a library that's not yet detected? #11

Closed IzzySoft closed 7 years ago

IzzySoft commented 7 years ago

I have an .apk where some library is not detected, though I know (from the project, and from decoding it with Apktool) it is there: version 0.04 of qutelauncher added Firebase Analytics – and accordingly, there's com/google/firebase/* in the Smali. I cannot find a match on that in data/tgst5.dat, so it's not reported.

So what would be the necessary steps in such a case? Plus, along those lines, is there a way to scan for undetected libraries one might not be aware of (one does not always know what to expect in some .apk – which is why one uses LibRadar in the first place :)?

IzzySoft commented 7 years ago

One more candidate: AndroidProxySetter uses android-proxy which is not detected.

If you could let me know (or include with the README) how to obtain the values for dn, bh, btc and btn, I (and others) could try to add them to data/tgst5.dat. If additional steps are required for other files (e.g. data/new_dict.dat or permission/tagged_dict.txt), it would be nice to know as well.

pkumza commented 7 years ago

Sorry for the late reply, I was very busy this days. new_dict.dat is the dictionary for API. the calculation is simple, stupid but easy to use.

for each api:
    b_hash = (b_hash + API_ID * API_NUM) % 999983

999983 is a large prime number. LibRadar use (b_hash, b_total_num, b_total_call) as an identifier. I am not sure that is right in mathematics proof but it works in most cases.

IzzySoft commented 7 years ago

Sorry for the late reply, I was very busy this days.

Yupp, thought so. We're all having more than one task at hand, so no worries it it takes a few days (a short note when estimates say it might take longer is welcome anytime – but I never expect immediate response for things that are not urgent. It's a hobby – and none of my recorded issues was about a "show breaker bug" ;)

Not being deeply involved with the technical details behind the library definitions, I unfortunately didn't understand your pointers here. Is there a short step-by-step instruction? Something along the lines of:

  1. run apktool d foo.apk
  2. cd to the directory the supposed lib is found in (for com/some/lib that's the directory where com resides)
  3. run foo com/some/lib to calculate bar
  4. ...

I take it in your code snippet, b_hash is what becomes bh. But it's unclear to me where API_ID and API_NUM come from, how b_hash must be initialized before the loop – and what happened to dn, btc and btn (or rather how to obtain their values).

Of course, I could always report my findings on "app uses lib " and have you do the work. But sometimes, I'd rather have that finding listed with the app before I forget what I need to rescan :)

LibRadar use (b_hash, b_total_num, b_total_call) as an identifier.

Ah yes, that's what bh, btn and btc stand for :) Still would need to know how to find those values, given a single APK using such an "unidentified library" :)

pkumza commented 7 years ago

All right. Here's the specific introduction and I want to make it simple but clear.

  1. run apktool d foo.apk
  2. cd to the directory the supposed lib is found in (for com/some/lib that's the directory where com resides)
  3. As a said before, new_dict.dat is the mapping table for APIs. read new_dict.dat into a hash map.
  4. Search every corner for the API appeared and get a dict for every API. {0:2, 1:4,3:1} This dictionary means API0 appears twice, API1 appears four times, API2 never appears and API3 appears once in com/some/lib. In this case, btn is 3 because three API types appears and btc is 7 as 2+4+1=7.
  5. b_hash is initiated as 0. Use the formula for each api: b_hash = b_hash + API_ID * API_NUM) % 999983 to calculate b_hash.

This is the way to calculate bh btc btn. dn stands for repetitions, so it is not calculated in this stage as we did't know if a sub-package is a library. I've got a lot of (bh, btc, btn) tuples, after that, we could cluster them into groups. dn stands for the size of a group.

if you want to add new library into the database, that means you've make sure that com/som/lib is a library. In this case, we don't need to get dn, just put bh, btc, btn, the library name and other information into database.

IzzySoft commented 7 years ago

Thanks! But sorry if I might sound dumb: I get it as far as to step 3 (which is, if I uderstand correctly, what's the first part of permission/tag_dict.py. But I'm lost with step 4: I assume "every corner" applies to the unpacked foo.apk. "for every API" applies to what? To the hash map? And while in step 5 I now have the base for b_hash, I'm still confused concerning API_ID * API_NUM.

As I doubt you've been doing those steps manually for the many libraries already listed: Don't you have some script you've used for that, which I'd run at step 3 and it does 3-5 and then spits out the line to be added (to new_dict.dat I assume)? And in the end, don't I have to run permission/tag_dict.py to tag the new entries? And then, how to get them to data/tgst5.dat?

pkumza commented 7 years ago

Search the byte code in smali file for some strings that matches the API. For example, you found a string "Landroid/widget/OverScroller;->getCurrVelocity(" in a smali file and this string matches {"key": "Landroid/widget/OverScroller;->getCurrVelocity(", "value": 12} in new_dict.dat, this means we need add 1 on API_ID-12. If this is the first time to match a string "Landroid/widget/OverScroller;->getCurrVelocity(", the dict in step 4 should be {12:1}. If we found a string "Landroid/view/ViewGroup;->onKeyDown(" later, the dict in step 4 should be {12:1, 258:1}. If we found string "Landroid/widget/OverScroller;->getCurrVelocity(" again, the dict in step 4 should be {12:2, 258:1}. After the whole package were scanned, we got a dict. For dict {12:2, 258:1}, b_hash should equal to (0+12*2+258*1)%999983

PS: There's no doubt that I did those steps with scripts but that was research things. Code were like patches and patches. Scripts also need to have access to database, which makes the scripts very hard to use. As adding a new library does not need many steps like clustering, there's no need to use those scripts in chaos. In fact, I got many API candidates that some of them were not actual Android API. So I deleted wrong ones and calculate hash number again and again.

IzzySoft commented 7 years ago

Woah. That would mean scanning everything manually, transmitting findings manually by copy-pasting (error prone!), calculating manually (hoping to not having missed an entry)... I'd really like to add my findings – but sorry, that's much too time consuming – especially since results had to be checked multiple times and still leaving doubt one got it right. Some script would be highly appreciated here.

Also a bit unclear is which objects/lines should be counted. E.g. for the proxy example, in Smali I find a lot of strings like

sget-object v0, Lbe/shouldit/proxy/lib/reflection/android/ProxySetting;->NONE:Lbe/shouldit/proxy/lib/reflection/android/ProxySetting;

No opening parenthesis – so not to be matched? But

invoke-virtual {v0, v1}, Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyHost(Ljava/lang/String;)V

would be a valid candidate? So to find all possible candidates for my library "pn": "be/shouldit/proxy/lib", I'd cd into the app's Smali directory (here: tk.elevenk.proxysetter_0.2/smali/tk and run

grep -hRE "Lbe/shouldit/proxy/lib.+;-.+\(" *

resulting (in my case) in

invoke-virtual {v0, v1}, Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyHost(Ljava/lang/String;)V
invoke-virtual {v0, v1}, Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyPort(Ljava/lang/Integer;)V
invoke-virtual {v0, v1}, Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyExclusionString(Ljava/lang/String;)V
invoke-virtual {v0, v1}, Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxySetting(Lbe/shouldit/proxy/lib/reflection/android/ProxySetting;)V
…

Then I'd need to strip off everything following the opening parenthesis plus everything before the Lbe/ and sorting the output, so make the command

grep -hRE "Lbe/shouldit/proxy/lib.+;-.+\(" * |awk -F "(" '{print $1 "("}' |awk -F "}," '{print $2}' | sort

Resulting lines now look like

Lbe/shouldit/proxy/lib/APL;->disableWifi(
Lbe/shouldit/proxy/lib/APL;->enableWifi(
Lbe/shouldit/proxy/lib/APL;->enableWifi(
Lbe/shouldit/proxy/lib/APL;->getConfiguredNetwork(
Lbe/shouldit/proxy/lib/APL;->getConfiguredNetwork(
Lbe/shouldit/proxy/lib/APL;->getConfiguredNetworks(
Lbe/shouldit/proxy/lib/APL;->getWiFiAPConfiguration(
Lbe/shouldit/proxy/lib/APL;->getWiFiAPConfiguration(
Lbe/shouldit/proxy/lib/APL;->getWifiManager(
Lbe/shouldit/proxy/lib/APL;->getWifiManager(
Lbe/shouldit/proxy/lib/APL;->setup(
Lbe/shouldit/proxy/lib/APL;->writeWifiAPConfig(
Lbe/shouldit/proxy/lib/enums/SecurityType;->equals(
Lbe/shouldit/proxy/lib/enums/SecurityType;->equals(
Lbe/shouldit/proxy/lib/enums/SecurityType;->name(
Lbe/shouldit/proxy/lib/enums/SecurityType;->toString(
Lbe/shouldit/proxy/lib/reflection/android/ProxySetting;->equals(
Lbe/shouldit/proxy/lib/reflection/android/ProxySetting;->equals(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getProxyExclusionList(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getProxyExclusionList(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getProxyHost(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getProxyPort(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getProxySetting(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getSecurityType(
Lbe/shouldit/proxy/lib/WiFiApConfig;->getSSID(
Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyExclusionString(
Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyHost(
Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxyPort(
Lbe/shouldit/proxy/lib/WiFiApConfig;->setProxySetting(

which fits your description and already makes counting easier. I've already checked with new_dict.dat that this library isn't yet there. So again I'm stuck as where to get the value from: take the last line of new_dict.dat and increase its value by 1? So after above first 3 lines, my dict would look like {99850:1, 99851:2} – and what then? What needs to be added to new_dict.dat (most likely "key": "Lbe/shouldit/proxy/lib/APL;->disableWifi(" "value": 99850} etc), where does the value behind the colon go to, and how does the entire match come into tgst5.dat?

Maybe it's easier if I instead submit the results of the last mentioned command (with some describing details), and you continue from there (as you've got routine to do that)? Once it is in tgst5.dat, I could proceed adding the missing details (as I did with all the other libs).

pkumza commented 7 years ago

Yes. I did this in Python script and used regex too, which is somewhat automatic. However adding a new library is still difficult. I would appreciate that if you could give me a list of libraries that do not appear in my database.

As "Lbe/shouldit/proxy/lib/APL;->disableWifi(" is a method that the code used, it does not mean that this string is a System API for Android. In my opinion, it comes from the code from another package and should not be recognized as an API. It's very easy to be obfuscated so I didn't put methods like this into new_dict.dat. Most of Android System API begins with "Landroid".

By the way, I will update the whole project and prepare to add automatically updating to database later this year (for my graduation project). More detailed information and functionality will be added.

IzzySoft commented 7 years ago

However adding a new library is still difficult.

I definitely agree :)

I would appreciate that if you could give me a list of libraries that do not appear in my database.

Whenever I find any. Until now, that's the two mentioned above:

In my opinion, it comes from the code from another package

I doubt that, but I might be wrong: I've limited my grep to just the application package directory itself (i.e. I did a cd tk.elevenk.proxysetter_0.2/smali/tk first), to avoid having the library's own "inner calls" recorded along.

By the way, I will update the whole project and prepare to add automatically updating to database later this year (for my graduation project). More detailed information and functionality will be added.

That sounds great! Fingers already crossed for your graduation!

pkumza commented 7 years ago

I will try to add this functionality before 4th March. After implementation, I'll send notification to your twitter. ^_^

IzzySoft commented 7 years ago

Uh. Hadn't you closed this I'd said you simply could close it when done, so I get a notification from Github…

pkumza commented 7 years ago

All right, issue reopened. ╮(╯▽╰)╭

pkumza commented 7 years ago

I've added the function. I will make some test cases to make sure it works and add documents about how to add a new lib this week.

IzzySoft commented 7 years ago

Thanks! Looks like soon it's time I try the new version then. OTOH, seeing the install instructions, I'm afraid it won't be that soon; it got too many dependencies. I prefer if things either come straight from the repositories, or run "out of their directory". Having to install self-compiled stuff via "make install" (Redis 3.2, as the repos only hold Redis 3.0) plus things via pip/pypi (which isn't installed itself even on my machine) is not my first choice ;)

Does it already return JSON the way it did before? Or did the format change?