Open RoyArends opened 4 years ago
@RoyArends There is a fundamental problem there. If the recursive resolver does QName minimization, the root will see "BOUYGUESTELECOM" even in cases when the client looked for "BT1SVWQM.NOE.BOUYGUESTELECOM". With your suggestion, the client's requests for BOUYGUESTELECOM will be split in two different buckets.
My first suggestion would be to investigate how big the problem really is. We can do that by dumping out all the NXDOMAIN name targets in which the name meets the DGA classification, and then see how many have multipart names. Count that, and see how much of the DGA total that is. If it is just a tiny fraction, then there is not much to worry about.
If the fraction is significant, then we need secondary analysis. I would do a count the number of occurrences of TLD in the "multipart DGA" category. If we saw some TLD used with significant frequency, we can add it to a list of "TLD that should not be mistaken for DGA", and use that list as part of the DGA classification.
If I remember correctly, BOUYGHESTELECOM used to be a registered TLD. Maybe, as precaution, we could special case all these "formerly registered TLD" to the special case list.
@RoyArends I am looking into this issue, and there is a tension between more precise accounting and compatibility with historic series. The current algorithm can be summarized as:
On capture, for all nx domain queries:
If the TLD name matches one of "special use names", classify as RFC6761 (metric M3.3.1);
Else if the TLD name belongs to the "most frequent" list, classify as frequent (M3.3.2);
Else if the TLD matches one of the special patterns (numeric, ipv4, bad syntax...),
list in the corresponding category;
Else classify as "size(length)".
When computing metrics:
Look at the most frequent patterns among length_N, numeric, IPv4, etc.
If the pattern is seen often enough: display and count as part of metric M3.3.3,
else do not display, summarize as part of M3.3.4.
When performing sum_m3 analysis:
For all "length" patterns:
If the length is between 7 and 15, count as "dga"
If length is 16 or larger, count as "jumbo"
To avoid breaking the compatibility with the existing statistics, I propose leaving the existing accounting unchanged, but adding 2 new listing in the summary files:
multi_N
: number of names found with unknown TLD of length N and multiple name partsalpha_N
: number of single-part names of length N, names only include letters.Alpha_7 to Alpha_15 would map the generation algorithm used by Google. This would become the new definition of "dga". We could then compute:
I think that would match expectations, but I would like confirmation.
Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.
Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.
Christian, the chromium DGA will only issue a single label as top level domain. Your code tests for this top level domain, but also classifies top level domains with subdomains as DGA, such as:
BT1SVWQM.NOE.BOUYGUESTELECOM GT7TRSFP0.APPLIS.SI.INTERNE
etc.
Maybe an additional classification where one is DGA (in general) of which Chromium DGA is a subset.
Roy