private-octopus / ithitools

MIT License
5 stars 6 forks source link

please add an additional check for the chromium DGA test #161

Open RoyArends opened 4 years ago

RoyArends commented 4 years ago

Christian, the chromium DGA will only issue a single label as top level domain. Your code tests for this top level domain, but also classifies top level domains with subdomains as DGA, such as:

BT1SVWQM.NOE.BOUYGUESTELECOM GT7TRSFP0.APPLIS.SI.INTERNE

etc.

Maybe an additional classification where one is DGA (in general) of which Chromium DGA is a subset.

Roy

huitema commented 4 years ago

@RoyArends There is a fundamental problem there. If the recursive resolver does QName minimization, the root will see "BOUYGUESTELECOM" even in cases when the client looked for "BT1SVWQM.NOE.BOUYGUESTELECOM". With your suggestion, the client's requests for BOUYGUESTELECOM will be split in two different buckets.

My first suggestion would be to investigate how big the problem really is. We can do that by dumping out all the NXDOMAIN name targets in which the name meets the DGA classification, and then see how many have multipart names. Count that, and see how much of the DGA total that is. If it is just a tiny fraction, then there is not much to worry about.

If the fraction is significant, then we need secondary analysis. I would do a count the number of occurrences of TLD in the "multipart DGA" category. If we saw some TLD used with significant frequency, we can add it to a list of "TLD that should not be mistaken for DGA", and use that list as part of the DGA classification.

If I remember correctly, BOUYGHESTELECOM used to be a registered TLD. Maybe, as precaution, we could special case all these "formerly registered TLD" to the special case list.

huitema commented 4 years ago

@RoyArends I am looking into this issue, and there is a tension between more precise accounting and compatibility with historic series. The current algorithm can be summarized as:

On capture, for all nx domain queries:
     If the TLD name matches one of "special use names", classify as RFC6761 (metric M3.3.1);
     Else if the TLD name belongs to the "most frequent" list, classify as frequent (M3.3.2);
     Else if the TLD matches one of the special patterns (numeric, ipv4, bad syntax...),
         list in the corresponding category;
     Else classify as "size(length)".

When computing metrics:
    Look at the most frequent patterns among length_N, numeric, IPv4, etc.
    If the pattern is seen often enough: display and count as part of metric M3.3.3,
    else do not display, summarize as part of M3.3.4.

When performing sum_m3 analysis:
    For all "length" patterns:
         If the length is between 7 and 15, count as "dga"
         If length is 16 or larger, count as "jumbo"

To avoid breaking the compatibility with the existing statistics, I propose leaving the existing accounting unchanged, but adding 2 new listing in the summary files:

Alpha_7 to Alpha_15 would map the generation algorithm used by Google. This would become the new definition of "dga". We could then compute:

I think that would match expectations, but I would like confirmation.

huitema commented 4 years ago

Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.

huitema commented 4 years ago

Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.