Closed jfoug closed 7 years ago
I don't know, for example:
A0`01` orig: ^1 ^0
A0`01` orig: ^1^0
This would be invalid in hashcat.
That was my own tool's output (it is not in the rule file).
In the rule file there are these 2 lines
^1 ^0
^1^0
They do exactly the same thing and this is 100% identical rule.
Some of the other instances are not as apparent.
ss5 sa@ se3 si1 so0
ss5 sa@ se3 so0 si1
ss5 sa@ si1 se3 so0
ss5 sa@ si1 so0 se3
ss5 sa@ so0 se3 si1
ss5 sa@ so0 si1 se3
ss5 se3 sa@ si1 so0
ss5 se3 sa@ so0 si1
ss5 se3 si1 sa@ so0
ss5 se3 si1 so0 sa@
ss5 se3 so0 sa@ si1
ss5 se3 so0 si1 sa@
ss5 si1 sa@ se3 so0
ss5 si1 sa@ so0 se3
ss5 si1 se3 sa@ so0
ss5 si1 se3 so0 sa@
ss5 si1 so0 sa@ se3
ss5 si1 so0 se3 sa@
ss5 so0 sa@ se3 si1
ss5 so0 sa@ si1 se3
ss5 so0 se3 sa@ si1
ss5 so0 se3 si1 sa@
ss5 so0 si1 sa@ se3
ss5 so0 si1 se3 sa@
All of those lines are in the file, and there is only 1 rule. The rest are duplicates.
There are literally 1000's of such lines (possibly 10's of 1000's). They are absolutely wasted cycles when used.
One thing that would clean up a lot of them, is to first run the rules through a space removal code, to a temp file (preserving line order). Then do an in-place file unique (not a sort | uniq). Then using the original space trimmed file, and the in-place unique file, you can write a script to determine which lines from the original file need to be removed, and remove them (I assume people would want to keep the easier to read .rule file with the spaces separating each rule primative).
Once that clean(er) file is made, then the search for the harder to find dupes could be done (such as the leet dupliates). There are many others which can be found without much problem. Those are:
[ and ]
^ and $
in same ruleIt will be about impossible to find all duplicates, especially in rule sets randomly generated. But a large number of them can be found. Some duplicates are very hard to spot, since they use different operations to do the same thing, OR have some operations which are no-op, thus hiding them. Simple examples such as
$ ]xA3
xA3
$<$k]
$<
These are harder to find, but can be done. The ones that are almost impossible to fine, without running data through the rules engine are the ones using different operations to do the same thing. I do not have any ready examples from the rules file, but made up examples would be things such as:
]]
}[}[
}}[[
$ wc ../dive.rule dive_dedupe.rule
123289 262282 973248 ../dive.rule
112552 244212 897871 dive_dedupe.rule
235841 506494 1871119 total
This was just pure easy to find duplicates, which simply were rules where there are spaces between rules difference. This was an 'in-place' removal. Only the 2nd (or more) duplicate was removed.
This change in of itself, gives about a 10% overall throughput improvement, if using that ruleset. There are still duplicates in there, btw. This is just the most easy write scripts to auto find and remove them.
Created PR #304 with the changes (only changes to dive.rule, and ONLY the pure white space identical rules).
But there are over 10k of them
What other types of rule removal does your script support? Did you check other rules files than dive.rule?
Right now, the only thing 'automated' is the finding of extraneous white space. No, I have not run that against other rule files, but easily could. I image that most of the rule files which are accumulations of 'random' generated rules will have duplicates in them
One other thing about dive (and likely others), is that the xNM command likely should be changed to the ONM I do not know if all of them should, but by looking at some of them, it is obvious that the logic for the current xNM was not right for the rule to find much. Things like
x42
'8 x42
are almost certainly supposed to be the logic for
O42
'8 O42
The logic behind the x and O rule changed recently, that may explains it
We just found something that is at least "suspicious". Can you please verify it's correct?
Here's an example: You've removed the following both rules:
^}
^ }
The thing is that those are two complete different outcomes. The first is just one function calls and the second has two function calls. Funny :)
I found about 30 or so like this. I have redone, just not pushed the changes. For the 2nd test, I actually pull entire 'proper' rules based upon length, and any left over white space is removed. the first run (that is checked in right now), was simply cat x | sed 's/ //g' so yes, what happened above failed to be viewed as distinct, and does need to be reverted back.
This shows the changes, and yes, ,there are rules in dive.rule that need put back, since the spaces removed WERE part of valid commands (mostly $ or ^ rules)
https://github.com/jfoug/oclHashcat/commit/7cb4ab0314444420826f9e262aba72d7d2634902
I could do a PR now with what is there, but I was going to look for other dupes. I know there are a lot more in dive.rule, I just do not know how easily they can be 'auto' removed. I certainly do not want to do hand edits for 100's or 1000's of different lines. That would take forever.
Some I have ways to find, I just need to make changes so that I can auto remove them. Those are:
pure 's' commands (order). cycles in s commands like sa4si1sa@ (made up). The sa@ is no-op where things like $ and ] or ^ and [ totally offset each other (i.e. become no-ops). order of blocks of [ and ] rules i.e. [] vs ][ or [[[] vs ][]], etc
I have code to point out the above dupes, but nothing to auto-remove (yet). I bet there are many more 'types' of duplicate rules, but i have not looked much deeper.
If you want me to, I could push for just the dive.rule line additions to get it corrected, and then wait on the rest until I have more done.
PR #307 adds back 30+ rules into dive.rule which were over aggressively removed with a brain dead sed space removal command. I used an intelligent space removal to do it the 2nd time which was based upon length of rule commands to remove spaces. The patch uploaded was the few items which were erroneously removed the first time. There are some other duplicate removals in that first patch, most (all?) were actually identical lines, not even space duplicates.
Then the 2nd commit is a bunch of harder to find dupes. In the commit, is a list of the exact lines removed from the dive.rule file. I also listed the canonization rules I used. I did things like sort blocks of 's' commands. Sort blocks of ^ and $ commands. sort blocks of [ and ] commands. Remove pairs of $x] and ^x[ commands, etc. The comment in the commit lists them all. I believe a lot of that duplication comes either from random generation of rules run at different times, OR possibly in someone making rulesets (in the case of s order) and not fully grasping what makes a covering set, or using some script to generate them, and having a busted permutation generator.
I am sure I will think of other dupe things over time ( { and } pairs are a no-op, etc). But some of the other dupes (there still are many), are much harder to find. Some of them are not 100% duplicate either, just duplicates for words of certain length ( such as '5 and p2'5 and d'5 )
I'll revert the change on dive.rule regarding the sed issue. But I can't yet merge the new PR because it doesn't take into account that oclHashcat does not support all function hashcat (CPU) supports.
are some of the dupes there due to these issues? That can be handled by changing the code to remove the duplicate which does not work on oclHashcat instead of simply removing the '2nd seen' duplicate.
Can you point me to documentation on which rules are not oclHashcat supported?
One thing I could do, is to simply provide the script files. There is nothing I am hiding in there.
I do think that no matter what, we need to have separate commits for each type of duplication found. These are:
This way, it is much easier to validate that everything being removed fits the rule properly, and that ALL are really duplicates.
WOW this is getting compliated :) You've remove the PR, which is fine. Are you going to create a new one?
yes. But I will do this step by step, since things are MUCH easier to see if only one type item is done.
By far the WORST offender, is the errant space duplicate problem in dive.rule That one alone is huge. The others do exist, but are not nearly as bad. But some of those are much more complex to see that they are right
dive.rule was created automatically
It is full of good rules, but is at least 15% slower (due to duplicates), than it should be.
The first is 100% 'pure' duplicate lines (i.e. lines that are already 100% the same as some prior line).
The 2nd is duplicates found by removal of superfluous white space (space characters). This was done by first tokenizing the rules themselves into proper commands of the right length, then removing all extra white space. This logic replaces the faulty cat xxx.rule | sed 's/ //g' > xxx_missing_ws.rule
logic which removed about 40 rules which were not duplicates.
Within comments in the PR, are all scripting code used to find these duplicates, and auto remove them from the rule files.
I will make no further PR about dupes, until these have been fully accepted. I do not foresee any problem here, since these really are 100% 'pure' duplicate rules.
Once everything is happy here, I will start removing other duplicates, but only 1 'rule type' at a time, so that it is easier to validate only correct duplicate rule lines are removed.
OK, merged
What's the next step?
I am going to work on it. But I want to do it a step at a time, so we can more easily validate that the changes ARE valid. There have been some missteps and finding them is not always easy when there are multiple types of duplication that is much more complex than what the diff program can find ;)
The next thing I am looking for is s 'reductions'. This can remove some duplicates, but will also change some rules.
The 'reductions' are like this:
sabsbc becomes sbc sabsac becomes sab
This will cause some duplicates, and also causes some rules to be reduced in number of operations. This is required to be done before I start to reorder the order of the s commands. That reorder finds many duplicates. but if these type no-ops, and cycles are there, then the reorder is not valid.
Unfortunately these are having to be dealt with by hand. I do not have an 'auto' fixer tool for these. I can find them however. These are the rules in dive.rule that I have found:
R0se5s5& turned into R0se&
s.hx13sh< turned into s.<x13
slOsOZ turned into slZ
sk@s@{oB8 turned into sk{oB8
+8sPMsMW turned into +8sPW
sYfsfut turned into sYut
sTDo5OsT* turned into sTDo5O
so!$=s!= turned into so=$=
so!s!=$= turned into so=$=
oA[s2dsdX turned into oA[s2X
s9Cs9W turned into s9C
sDJ*7AsD@ turned into sDJ*7A
s7ys7) turned into s7y
sDk*20sDF turned into sDk*20
s^6s6I turned into s^I
siNsNk turned into sik
sOusOG turned into sOu
sfusue turned into sfe
sZEsEd turned into sZd
sXcs^Us^` turned into sXcs^U
y2s|asaR turned into y2s|R
sVusV*o1{ turned into sVuo1{
syMsyP turned into syM
saGsGE turned into saE
I'm a bit confused, do you plan to send in another PR?
Yes I do, but I sent you the big change off list. I wanted to hear from you first, since this was a set of MUCH HARDER to spot duplication. Some are trivial to see, some not so much. I just thought you might want to run some testing yourself, before we accept this PR,
But I can get a PR done, no problem.
Btw, that PR will be the last for dive (unless we go through and look for and remove no-op commands from some rules to speed them up). Once that PR is done, there are NO more duplicates in dive. The method of running data will flush out all of them.
While looking at some of the optimization, here's one type that I find a bit suspecious:
sZEsEd turned into sZd
Now that looks fine at this point, but if you have a word in your wordlist which contains 'E' the results of the optimized rule is different to the original one.
Anyway, I think this would be an edge case and I think I'll merge anyway. I'll think about
I see your point, you are 100% correct.
ZERO sZEsEd == ddRO sZd == dERO
That certainly is different. That 'chain' optimization was one I 'saw' by hand, and then coded for it. We certainly can undo them. There were some, but not a huge amount. They should have all been in one of the earlier commits, so they should be easy to undo, and I certainly can get them back in.
You are not seeing these in the current dup set? I would think not... With the data provided I would find it hard to see anything like that not being seen as different.
One thing I 'should' have done, is with the last set of duplicates, to search the block that were all the same, and chosen the first one of the shortest number of operations. It would have removed more of the 'non-optimal' rules. A lot of them early on were optimal (some just 1 command). But I am pretty sure that some of them were not the best of the group. The program I wrote simply took the first rule of the group (i.e. earliest line number which that 'type' was found).
If you have not merged yet, I 'could' redo the commit, and change the logic to pick the rule with fewest commands.
I see you have not merged that yet, so if you want me to, I can certainly change the PR to behave that way. It would make better, slightly quicker rules.
OK, please undo that specific undupe logic.
Do you plan to apply the fewest-function-calls-first optimization on all rules or just dive? If it's just dive, go for it. For other ones, especially the hand-written ones, it's maybe not ideal because the author maybe had a specific reason to sort it in the way it is.
I fully agree there.. if something in hand written, unless it is wrong (like unforeseen duplicates), then it pretty much should stay intact.
On this last PR, why don't you reject it, and I will put some more time in on it. I will do the 'least' commands change, and will also spend some time looking for things that are not duplicates. I did find some.
duplicate rules:
(115)
(53541) 'V
(100438) 'Z
(105211) 'W
These are not duplicates, but passed my random input data due to line lengths. I thought I had long enough stuff, but I guess not.
duplicate rules: (non-optimal)
(185) }[$e
(186) }$e[
(410) ]$e
duplicate rules: (non-optimal)
(225) }o05
(95882) ]^5
duplicate rules: (non-optimal)
(226) }o04
(90399) ^4]
(95884) ]^4
The min-command logic would have caught the 2nd block. But the optimal speed for block 3 and 4 is probably the last item, since ] and ^ are probably much faster than the } operator.
I will get the non-duplicate s-chain rules back in where they were removed. That will be quicker. Then the de-dupe of dive.rule I will work on a bit.I am traveling for the next few days, so I may have down time to do this, and I may not, hard to say.
I'll reject the PR as suggest. Please send in a new one when you have time
Which of these would you rather see in the rules:
$6$6$6
$6Z2
$6$6$6$6
$6Z3
In the 2nd set, I can really see using $6Z3 2 commands vs 4. But in the other one, I am not sure how much time savings there is, vs ease of reading the rule by non-techies, OR does that matter at all ?
I actually like the john rule of Az'666' or Az|6666| but that is not the question ;) It gets much easier for pre-pending: A0'123' vs ^3^2^1 but at this time, this is not in HC, due to lack of handling the z (or l, n, m) length values, or the A command.
But there are quite a few of these app/pre-pend the same character multiple times. It may be best to use z or Z for any that are more than 2 same characters. The pick least commands duplicate should catch a large percentage of these, since I bet many of them have been duplicated.
definitely $6$6$6 and $6$6$6$6
all about easy to read :)
all about easy to read :)
Seriously ? I thought it was about the fastest way to crack ;)
But I can certainly do $6$6$6....$6 if you want it that way, and leave these readable but non-optimal.
NOTE, some of the reduction of command operations WILL make rules that are more obscure and harder to read, for sure. Many times they become 'better', but sometimes they are fewer commands, but more obscure (like the $6Z3 vs $6$6$6$6) And I know that some commands may be much slower 'in code' than others. ] has to be pretty damn fast vs other commands that may be a bit slower. I would also thing that $ and ^ would be faster than using i but that may not be the case.
Are there any rules which are NOT supported by the GPU that I should NOT be using for some of these ruleset, or be on the lookout for in the rulesets? I have looked in the wiki and other places, and i can not seem to find much about what is NOT in the GPU, other than 256 command limit (which is HUGE, btw).
I will spend some time on the dive duplicates found submission. It may end up being more than 1 PR, I do not know. It is huge, and NOT a simple 'rule' I found, but 100's if not 1000's of little things that can cause duplication (like [ or D0 or O01 or {] or ^#[[ etc). These can be trickier to make sure they are dupes (like I mention above, I HAVE found some that are not). Also, getting an auto-written script/program to figure out which of a set of duplicates should be kept AS THE rule, is also not easy. So instead of jamming something in, I will spend a bit of time, and try to get it right, or if still too hard, to break it up into more manageable sizes. But even if we say 1000 rule changes is 'manageable', that means 10 PR's and I think that in itself is a bit much.
I full agree, such a script is not easy. Anyway, thanks for your effort. Please continue with it.
I am glad we put off the prior work. There were problems with the script. I added much validity checking, making sure that everything matches perfectly. First off, there were 6 comment lines in the file I was not accounting for. But I have to 'start' the adding of offsets at 7. Then about a third of the way through removing data, the matching line is only offset of 6. Somewhere there must be a rule line that is confusing my dupe finder, but for the life of me, I can not find it. But now that I have the checking logic, that one line really should not matter. (it is one line turning into 2 somehow, but I do not know which one, or why).
Its amazing what can be done when your flight in arrives at 8:55 pm, and the layover departure is at 7:00 am ;) Find all sorts of things to try to make the time pass.....
Ive been working on this the a bit over the weekend. I think its now pretty solid and safe. I do have it picking the smallest rule. It chooses the lowest command count, and if 2 commands are same count, it chooses the shorter (character wise) of the 2. The exception is for things like: $x$x$x..$x or ^x^x^x,.. Those are kept intact since you wanted them that way, instead of picking the shorter $xZ3 or whatever it is. BUT using shorter length of same number of commands gets you things like ^a^b^c instead of i0ai0bi0c type ones, and i find the ^ to be easier to remember.
I have dug down to a couple thousand lines, and all spot checks are 100% duplicate. There are no more 'hidden' non duplicates that I have seen at all. There are certainly a lot of duplicates which would have been about 100% impossible to spot, without software to simply run the rules against wordlists made to try to flush out the duplication. Rules that do not look ANYTHING close to being the same, but when you dig down and look, they certainly are. There are many that simply have no op stuff in them, such as i0x$a[ That simply is $a due to everything else just being busy work, doing nothing towards the end result. Theres quite a bit of that.
But now, when this all gets done, there will be a little over 98k rules in dive. I think that's about 25% down from where it started. Still the same ruleset. A good one. It will just work that much faster now. Also, I think it will work better, with the 25k x commands now being O commands, which is the 'proper' logic that originally cracked 'something' to get added to the ruleset. You might put a notify out that people may actually want to re-run that ruleset if they have used it since the logic change of x converting into O, or at least to grep the O rules out and run them again.
I will certainly finish up dive this week, and will probably post it as a single PR, and send you offlist the list of work done, and completed rule set so that you can look at it offline. I did re-run the entire ruleset before and after against a 6000 word generated list through john, capturing all output words, then uniquing them and sorting. There were about 100 words not generated after the processing sorting and uniquing, there were about 100 lines not in the new data, but they all had somewhat the same pattern, and all had the same binary garbage in them, and I think it is an issue within john's rule proccessing for some encoding aware case changing. I am not re-running, since it's about 4 hours or processing, and I have to have the laptop plugged in, since it chews up the battery when under that load. I 'know' the rule files are equivalent (certainly for HC), it is just some nuanced issues in john that HC does not deal with (broken unicode letters in the random garbage I generated to find the duplicates). Magnum did a lot of the encoding stuff, and there are times I would like to re-do some of it to better handle arbitrary binary data. The encoding logic too easily screws up malformed input data, where I think we should simply leave it as original. But 99.99% of the time, it does not matter at all.
Well, once I get dive done, I will run my dupe finding code on some of the other rule sets, to see how clean they are. The 2 generated might have some, as might d3ad. The PWP likely has some in it. Their other rule file seems very clean (hand done). It usually is the random generated stuff that is bad about the duplicates, because just like writing code, there are 1000's of ways to say foo-bar, lol.
There are many that simply have no op stuff in them, such as i0x$a[ ...
I found the example interessting. Are you going to remove the useless work here?
... and completed rule set so that you can look at it offline.
How would I check through 25% removed rules manually. I could do only a few random picks and I'd need to have something to compare, so I'd need at least 2 rules. However, you already said it's 100% correct above, I'll simply trust in that and going to merge.
But now, when this all gets done, there will be a little over 98k rules in dive. I think that's about 25% down from where it started. Still the same ruleset.
Great work, really
There are many that simply have no op stuff in them, such as i0x$a[ ...
I found the example interessting. Are you going to remove the useless work here?
A lot (most?) of this was removed with the last update. However, if a rule was unique in it's logic, but it still had no-op work in it (offsetting crap, or conditions that can never happen), then it would have been left in. However, the method I used in the script to choose the rule to keep out of the block of duplicate rules, would pretty much assure that all no-work commands were removed (most of the time). The 'keep' rules were:
So because of the above 'keeper' rules, most rules that had offsetting or no-op work were scrubbed. So things like i0x$a[
would always get scrubbed, since there was a duplicate $a
in the data set. However, if there was i0x$x[i3z$m
and there was not a $x$mi3z
rule, then the rule that had that extra no-op pair of commands would not have been removed.
Are there any left? I would say there almost certainly are. However, there likely are few of them, being that if it was randomly generated and DID crack something (to get added to the ruleset at some point), then it is highly likely that a more canonical representation of the rule would also be generated and find something and also added to the ruleset at sometime later. This is how almost all of this duplication grew over time. A good 'random' rule like sa4se3so0 was created and cracked a bunch. It gets added. Later, so0sa4se3 gets random generated, and it also cracks a bunch, so it also gets added. However, together, only 1 of those rules will crack something. The other never will.
So to sum up your original question, no I have not specifically written code to look for these 'never can happen' commands (like usa4) or the offsetting work commands (like ^0[
) . But a large percentage HAVE been found and removed as a side effect of the duplication removal project.
Here is an interesting block of duplicate rules found in d3ad0ne ;)
duplicate rules:
(5687) *63*63
(8215)
(14308) ^>O01
(14681) ^QD0
(21265) kk
(26422) rr
(29907) tt
(34360) z3O13
(34733) z5O15
(35307) }{
Lots of cycles spinning on those, lol.
An interesting experiment would be to figure out how many 'make work' rules could be made that end up doing no work at all. However, I think that there are infinite ways to use the rules engine to give no end change to word, so I do not think that 'experiment' would get very far.
It may not be bad to start to 'make' some canonical rules for some of these things.
Things like all work on front of word should be first then all work on rear of word.
So ^1$1 would be canonical, but $1^1 would not.
Or, it may be better to claim that commands should be in ASCII sorted order, whenever there are commands which can switch around. (I like that definition better).
Then things like:
sa&se3so0 would be in canonical order, but se3so0sa& would not l$1 would not be canonical, but $1l would This would not be hard to do within a program. A rule could be re-arranged, and then checked to make SURE it is still a duplicate.
Other things: all O01 should be D0 I0x should be ^x, etc. Actually, all Ox1 should be replaced with Dx Also, DxDyDz if xyz are contiguous, should be replaced with Ox3 (or Oy3, etc, whatever is compatible). There are more of these type multiple ways but one way is better command blocks.
I am not sure that the ROI is really worth it in this case. Once we have reduced a rule set to all non-unique rules, then if they are i a canonical format or not may not matter all that much. Now, if the rule set is being built (especially with random rules), then that would be a different story. But I do not think there are any ongoing random rule accumulation projects. If there are, then it might be smart to address them in some canonical manner, as that will help keep some of the duplication out of the mix. Also, within your rand generation code, it may be possible to build a canonical 'filter' so that when new random rules are generated, at least they are 'standardized' in some manner.
@jfoug What's the status with the issue? I think you've corrected all the "generated" rulesets, right?
Jim is on a road trip all over US. I think he's back next week.
Examples:
I am building a canonical / optimization / duplicate tool for jtr (currently just a prototype in perl slowly being put together, and some stand alone c). I am looking to leverage this to help reduce duplicates and or optimized rules, especially when we start doing rules x rules x rules logic
I believe the rules files here have been provided by numerous private parties, and some may and some may not be maintained by said people. I was just wanting to know if it would be something wanted that when I find large groups of technically duplicate rules, that I document and or build PR's to fix them? If not, then feel free to close this issue, and I will not bother.