google / safebrowsing

Safe Browsing API Go Client
Apache License 2.0
470 stars 129 forks source link

Inconsistency when repeating the same query #87

Closed e3rd closed 5 years ago

e3rd commented 5 years ago

I've been checking a set of 10 k of URLs multiple times. When I post whole set to /v4/threatMatches:find at once, sbserver returned safebrowsing: unexpected server response code: 400 in most of cases. However, if I then asked only for the first 1 k of the set, it worked, then the first 2 k of the set, it worked... then whole set at once it worked.

To handle this, I chunked the whole set to subsets by 100 items – that mostly worked. Nevertheless, sometimes few items (about 150) are wrongly indicated as being safe. It seems these are always the same URLs (at least hXXp://www.ww.kuplon.cz/darkovy-poukaz was always one of the erroneously skipped URLs). Since I couln't identify what is the élément déclencheur, what is the reason this small subset is sometimes skipped, I'll be rather experimenting with https://github.com/afilipovich/gglsbl .

I post it here as well for future reference but since I don't have any additional information nor evidence I think you may close the issue as a won't fix.

colonelxc commented 5 years ago

Were those previous tests done with a custom client?

I ask because this client (and it appears gglsbl) does not use threatMatches:find (it always uses local db and hash lookups "/v4/fullHashes:find").

400 errors

If the requests were done with this client, I can guess why requesting the whole list might have failed while requesting them incrementally succeeded. To process a url, it canonicalizes the url, then creates a bunch of hash-prefixes. Then it checks those against a local DB of hash prefixes. Any/all matches then get sent (just the hash prefix) to the server, which then returns the full hashes. Finally, the client compares the full hashes, saves the result in a cache, and returns the result to you.

There is a limit of 500 hashes per request (I see this is only documented for threatMatches:find right now, and is missing from fullHashes:find. I'll get that fixed), so if you pass 10k urls (which may turn into ~100k hash prefixes), it's possible that there are more than 500 that match the local hash-prefix database.

So why does incrementally growing your set work? Well, it seems that each 1k urls were not hitting that 500 hash prefix limit (remember, only ones that match the local database are sent, so most wont get sent). The server tells the client to cache results (for 5 minutes, typically). So when you request the first 2k urls, all the first 1k are already known/cached, and those don't go to the server again. By the time you're requesting the full 10k, the previous 9k are already known/cached, and you manage to stay under the limit.

We could handle this case better by doing some internal batching. As far as I can tell gglsbl has the same issue. I'll open an issue for them as well.

Inconsistent results for a few URLs

This is concerning, and I will look into it. If you could, could you describe the behavior of the inconsistency?

As in: Did it start flagging as bad, then switch to safe, then stay safe for multiple checks? (The caching involved might be a involved). Or did it toggle back and forth regularly? If you can also provide details for how you are performing the calls (code), that could also be helpful.

e3rd commented 5 years ago

Thanks for the overall explanation!

All tests were done with current master of this very client. In the last phase of my testing I had a Set of about 1200 URLs that were previously flagged by Google SafeBrowsing.

I've repeteadly queried the Set, chunked by 100 URLs. At the end of the block, I put a Python ipdb breakpoint so that a shell is open. I randomly got Usecase 1/or 2, don't know if I waited sometimes 5 minutes or not so that caching mightbe involved. Note that if Usecase 2 happen, the subset (from where the URL hXXp://www.ww.kuplon.cz/darkovy-poukaz) seem the very same.

All requests are header-200.

Usecase 1: all URLs correctly flagged as not safe

Usecase 2:

        def chunks(l, n):
            """Yield successive n-sized chunks from l."""
            for i in xrange(0, len(l), n):
                yield l[i:i + n]

        l = set()
        for chunk in chunks(urls, 100):
            data = {"threatInfo":
                        {"threatTypes": ["UNWANTED_SOFTWARE", "MALWARE"],
                         "platformTypes": ["ANY_PLATFORM"],
                         "threatEntryTypes": ["URL"],
                         "threatEntries": [{"url": u} for u in chunk]}}
            resp = requests.post('%(client)s/v4/threatMatches:find' % {"client": self.client_url}, json=data)
            print(resp)
            try:
                for uri in resp.json()["matches"]:
                    l.add(uri["threat"]["url"])

                url = u'http://www.ww.kuplon.cz/darkovy-poukaz'
                if url in chunk and url not in l:
                    print(url)
                    import ipdb; ipdb.set_trace()
                elif url in chunk:
                    print("URL here.")
            except ValueError:  # server denies
                logging.error(resp.text)
                raise
            except KeyError:
                # valid answer - no URL in the list
                pass

        if len(l) != len(urls):
            print("!!!!!!!!!!! ************ !!!!!!!!!!")
            print("Something went wrong")
            print(len(l), len(urls))
            print(set(urls) - l)
            import ipdb; ipdb.set_trace()

        return l
colonelxc commented 5 years ago

Thanks for the example code. I'll take a look and see if I can reproduce reliably. It's a holiday week here, so it might take some time.

Do you see the same inconsistency (modulo client side caching) when using the Go client?

e3rd commented 5 years ago

I fear I don't get the question – this whole GitHub project hosts Go client. What other Go client do you have in mind? You mean if I tried to re-write the Python code into Go? (No, I didn't but since the communication with sbserver is via POST request, I don't expect anything to change.)

colonelxc commented 5 years ago

Sorry for the confusion. I didn't mean to write a new client. I was just wondering if there issue was reproducible with both this Go client and your custom Python script. They used different apis, which will help narrow down where the issue is.

On Fri, Nov 23, 2018, 6:21 AM Edvard Rejthar notifications@github.com wrote:

I fear I don't get the question – this whole GitHub project hosts Go client. What other Go client do you have in mind? You mean if I tried to re-write the Python code into Go? (No, I didn't but since the communication with sbserver is via POST request, I don't expect anything to change.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/safebrowsing/issues/87#issuecomment-441252646, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKz1w5jhILmumBHeASx-YQFJzYQqchsks5uyARbgaJpZM4Ytt3L .

e3rd commented 5 years ago

I still fear I didn't get it.

They used different apis

My Python script is just calling this Go client. It doesn't use any different API.

colonelxc commented 5 years ago

Ah! I was confused. I thought you were calling our public API with your script. Now I see it wasn't providing an API key, so that should have been my clue. I take it you are using 'sbserver' command.

I'm now able to reproduce the flaw with sbserver, and I think I understand what is causing it. It impacts different urls which have the same expressions.

Example: Query badurl.com/a and badurl.com/b, if badurl.com was blacklisted would 'match' just one of the two (I think always the last one) on the first request, then both of them on subsequent requests (as long as the server cached the results).

I'm not sure why it would take 4 requests to get to the 'all results match'. I would expect it to happen in two requests. I'll try to get a fix in this week, then you can test again and see if you still see any issues.