hashcat / hashcat

World's fastest and most advanced password recovery utility
https://hashcat.net/hashcat/
21.03k stars 2.89k forks source link

Many rule sets contain duplicate rules #301

Closed jfoug closed 7 years ago

jfoug commented 8 years ago

Examples:

[]  orig: ] [
[]  orig: ][
[]$;  orig: ] [ $;
[]$;  orig: ][$;

sg9se3sa4  orig: sa4 se3 sg9
sg9se3sa4  orig: sa4 sg9 se3

A0`01`  orig: ^1 ^0
A0`01`  orig: ^1^0

A0`02`  orig: ^2 ^0
A0`02`  orig: ^2^0

A0`03`  orig: ^3 ^0
A0`03`  orig: ^3^0

sa@se3si1  orig: sa@ si1 se3
sa@se3si1  orig: si1 sa@ se3
sa@se3si1  orig: si1 se3 sa@
sa@se3si1so0  orig: sa@ se3 so0 si1
sa@se3si1so0  orig: sa@ si1 se3 so0
sa@se3si1so0  orig: sa@ si1 so0 se3
sa@se3si1so0  orig: sa@ so0 se3 si1
sa@se3si1so0  orig: sa@ so0 si1 se3
sa@se3si1so0  orig: so0 sa@ se3 si1
sa@se3si1so0  orig: so0 sa@ si1 se3
sa@se3si1so0  orig: so0 se3 sa@ si1
sa@se3si1so0  orig: so0 se3 si1 sa@
sa@se3si1so0  orig: so0 si1 sa@ se3
sa@se3si1so0  orig: so0 si1 se3 sa@
sa@se3si1so0ss5  orig: ss5 sa@ se3 si1 so0
sa@se3si1so0ss5  orig: ss5 sa@ se3 so0 si1
sa@se3si1so0ss5  orig: ss5 sa@ si1 se3 so0
sa@se3si1so0ss5  orig: ss5 sa@ si1 so0 se3
sa@se3si1so0ss5  orig: ss5 sa@ so0 se3 si1
sa@se3si1so0ss5  orig: ss5 sa@ so0 si1 se3
sa@se3si1so0ss5  orig: ss5 se3 sa@ si1 so0
sa@se3si1so0ss5  orig: ss5 se3 sa@ so0 si1
sa@se3si1so0ss5  orig: ss5 se3 si1 sa@ so0
sa@se3si1so0ss5  orig: ss5 se3 si1 so0 sa@
sa@se3si1so0ss5  orig: ss5 se3 so0 sa@ si1
sa@se3si1so0ss5  orig: ss5 se3 so0 si1 sa@
sa@se3si1so0ss5  orig: ss5 si1 sa@ se3 so0
sa@se3si1so0ss5  orig: ss5 si1 sa@ so0 se3
sa@se3si1so0ss5  orig: ss5 si1 se3 sa@ so0
sa@se3si1so0ss5  orig: ss5 si1 se3 so0 sa@
sa@se3si1so0ss5  orig: ss5 si1 so0 sa@ se3
sa@se3si1so0ss5  orig: ss5 si1 so0 se3 sa@
sa@se3si1so0ss5  orig: ss5 so0 sa@ se3 si1
sa@se3si1so0ss5  orig: ss5 so0 sa@ si1 se3
sa@se3si1so0ss5  orig: ss5 so0 se3 sa@ si1
sa@se3si1so0ss5  orig: ss5 so0 se3 si1 sa@
sa@se3si1so0ss5  orig: ss5 so0 si1 sa@ se3
sa@se3si1so0ss5  orig: ss5 so0 si1 se3 sa@
sa@se3si1ss5  orig: ss5 sa@ se3 si1
sa@se3si1ss5  orig: ss5 sa@ si1 se3
sa@se3si1ss5  orig: ss5 se3 sa@ si1
sa@se3si1ss5  orig: ss5 se3 si1 sa@
sa@se3si1ss5  orig: ss5 si1 sa@ se3
sa@se3si1ss5  orig: ss5 si1 se3 sa@
sa@se3so0  orig: sa@ so0 se3
sa@se3so0  orig: so0 sa@ se3
sa@se3so0  orig: so0 se3 sa@
sa@se3so0ss5  orig: ss5 sa@ se3 so0
sa@se3so0ss5  orig: ss5 sa@ so0 se3
sa@se3so0ss5  orig: ss5 se3 sa@ so0
sa@se3so0ss5  orig: ss5 se3 so0 sa@
sa@se3so0ss5  orig: ss5 so0 sa@ se3
sa@se3so0ss5  orig: ss5 so0 se3 sa@
sa@se3ss5  orig: ss5 sa@ se3
sa@se3ss5  orig: ss5 se3 sa@
sa@si!so0  orig: sa@ so0 si!
sa@si!so0  orig: sa@so0si!
sa@si1so0  orig: sa@ so0 si1
sa@si1so0  orig: so0 sa@ si1
sa@si1so0  orig: so0 si1 sa@
sa@si1so0ss5  orig: ss5 sa@ si1 so0
sa@si1so0ss5  orig: ss5 sa@ so0 si1
sa@si1so0ss5  orig: ss5 si1 sa@ so0
sa@si1so0ss5  orig: ss5 si1 so0 sa@
sa@si1so0ss5  orig: ss5 so0 sa@ si1
sa@si1so0ss5  orig: ss5 so0 si1 sa@
sa@si1ss5  orig: ss5 sa@ si1
sa@si1ss5  orig: ss5 si1 sa@

I am building a canonical / optimization / duplicate tool for jtr (currently just a prototype in perl slowly being put together, and some stand alone c). I am looking to leverage this to help reduce duplicates and or optimized rules, especially when we start doing rules x rules x rules logic

I believe the rules files here have been provided by numerous private parties, and some may and some may not be maintained by said people. I was just wanting to know if it would be something wanted that when I find large groups of technically duplicate rules, that I document and or build PR's to fix them? If not, then feel free to close this issue, and I will not bother.

jsteube commented 8 years ago

I don't know, for example:

A0`01`  orig: ^1 ^0
A0`01`  orig: ^1^0

This would be invalid in hashcat.

jfoug commented 8 years ago

That was my own tool's output (it is not in the rule file).

In the rule file there are these 2 lines

^1 ^0
^1^0

They do exactly the same thing and this is 100% identical rule.

Some of the other instances are not as apparent.

ss5 sa@ se3 si1 so0
ss5 sa@ se3 so0 si1
ss5 sa@ si1 se3 so0
ss5 sa@ si1 so0 se3
ss5 sa@ so0 se3 si1
ss5 sa@ so0 si1 se3
ss5 se3 sa@ si1 so0
ss5 se3 sa@ so0 si1
ss5 se3 si1 sa@ so0
ss5 se3 si1 so0 sa@
ss5 se3 so0 sa@ si1
ss5 se3 so0 si1 sa@
ss5 si1 sa@ se3 so0
ss5 si1 sa@ so0 se3
ss5 si1 se3 sa@ so0
ss5 si1 se3 so0 sa@
ss5 si1 so0 sa@ se3
ss5 si1 so0 se3 sa@
ss5 so0 sa@ se3 si1
ss5 so0 sa@ si1 se3
ss5 so0 se3 sa@ si1
ss5 so0 se3 si1 sa@
ss5 so0 si1 sa@ se3
ss5 so0 si1 se3 sa@

All of those lines are in the file, and there is only 1 rule. The rest are duplicates.

There are literally 1000's of such lines (possibly 10's of 1000's). They are absolutely wasted cycles when used.

jfoug commented 8 years ago

One thing that would clean up a lot of them, is to first run the rules through a space removal code, to a temp file (preserving line order). Then do an in-place file unique (not a sort | uniq). Then using the original space trimmed file, and the in-place unique file, you can write a script to determine which lines from the original file need to be removed, and remove them (I assume people would want to keep the easier to read .rule file with the spaces separating each rule primative).

Once that clean(er) file is made, then the search for the harder to find dupes could be done (such as the leet dupliates). There are many others which can be found without much problem. Those are:

It will be about impossible to find all duplicates, especially in rule sets randomly generated. But a large number of them can be found. Some duplicates are very hard to spot, since they use different operations to do the same thing, OR have some operations which are no-op, thus hiding them. Simple examples such as

$ ]xA3
xA3

$<$k]
$<

These are harder to find, but can be done. The ones that are almost impossible to fine, without running data through the rules engine are the ones using different operations to do the same thing. I do not have any ready examples from the rules file, but made up examples would be things such as:

]]
}[}[
}}[[
jfoug commented 8 years ago
$ wc ../dive.rule dive_dedupe.rule
 123289  262282  973248 ../dive.rule
 112552  244212  897871 dive_dedupe.rule
 235841  506494 1871119 total

This was just pure easy to find duplicates, which simply were rules where there are spaces between rules difference. This was an 'in-place' removal. Only the 2nd (or more) duplicate was removed.

This change in of itself, gives about a 10% overall throughput improvement, if using that ruleset. There are still duplicates in there, btw. This is just the most easy write scripts to auto find and remove them.

jfoug commented 8 years ago

Created PR #304 with the changes (only changes to dive.rule, and ONLY the pure white space identical rules).

But there are over 10k of them

jsteube commented 8 years ago

What other types of rule removal does your script support? Did you check other rules files than dive.rule?

jfoug commented 8 years ago

Right now, the only thing 'automated' is the finding of extraneous white space. No, I have not run that against other rule files, but easily could. I image that most of the rule files which are accumulations of 'random' generated rules will have duplicates in them

One other thing about dive (and likely others), is that the xNM command likely should be changed to the ONM I do not know if all of them should, but by looking at some of them, it is obvious that the logic for the current xNM was not right for the rule to find much. Things like

x42
'8 x42

are almost certainly supposed to be the logic for

O42
'8 O42
jsteube commented 8 years ago

The logic behind the x and O rule changed recently, that may explains it

jsteube commented 8 years ago

We just found something that is at least "suspicious". Can you please verify it's correct?

Here's an example: You've removed the following both rules:

^}
^ }

The thing is that those are two complete different outcomes. The first is just one function calls and the second has two function calls. Funny :)

jfoug commented 8 years ago

I found about 30 or so like this. I have redone, just not pushed the changes. For the 2nd test, I actually pull entire 'proper' rules based upon length, and any left over white space is removed. the first run (that is checked in right now), was simply cat x | sed 's/ //g' so yes, what happened above failed to be viewed as distinct, and does need to be reverted back.

jfoug commented 8 years ago

This shows the changes, and yes, ,there are rules in dive.rule that need put back, since the spaces removed WERE part of valid commands (mostly $ or ^ rules)

https://github.com/jfoug/oclHashcat/commit/7cb4ab0314444420826f9e262aba72d7d2634902

I could do a PR now with what is there, but I was going to look for other dupes. I know there are a lot more in dive.rule, I just do not know how easily they can be 'auto' removed. I certainly do not want to do hand edits for 100's or 1000's of different lines. That would take forever.

Some I have ways to find, I just need to make changes so that I can auto remove them. Those are:

pure 's' commands (order). cycles in s commands like sa4si1sa@ (made up). The sa@ is no-op where things like $ and ] or ^ and [ totally offset each other (i.e. become no-ops). order of blocks of [ and ] rules i.e. [] vs ][ or [[[] vs ][]], etc

I have code to point out the above dupes, but nothing to auto-remove (yet). I bet there are many more 'types' of duplicate rules, but i have not looked much deeper.

If you want me to, I could push for just the dive.rule line additions to get it corrected, and then wait on the rest until I have more done.

jfoug commented 8 years ago

PR #307 adds back 30+ rules into dive.rule which were over aggressively removed with a brain dead sed space removal command. I used an intelligent space removal to do it the 2nd time which was based upon length of rule commands to remove spaces. The patch uploaded was the few items which were erroneously removed the first time. There are some other duplicate removals in that first patch, most (all?) were actually identical lines, not even space duplicates.

Then the 2nd commit is a bunch of harder to find dupes. In the commit, is a list of the exact lines removed from the dive.rule file. I also listed the canonization rules I used. I did things like sort blocks of 's' commands. Sort blocks of ^ and $ commands. sort blocks of [ and ] commands. Remove pairs of $x] and ^x[ commands, etc. The comment in the commit lists them all. I believe a lot of that duplication comes either from random generation of rules run at different times, OR possibly in someone making rulesets (in the case of s order) and not fully grasping what makes a covering set, or using some script to generate them, and having a busted permutation generator.

I am sure I will think of other dupe things over time ( { and } pairs are a no-op, etc). But some of the other dupes (there still are many), are much harder to find. Some of them are not 100% duplicate either, just duplicates for words of certain length ( such as '5 and p2'5 and d'5 )

jsteube commented 8 years ago

I'll revert the change on dive.rule regarding the sed issue. But I can't yet merge the new PR because it doesn't take into account that oclHashcat does not support all function hashcat (CPU) supports.

jfoug commented 8 years ago

are some of the dupes there due to these issues? That can be handled by changing the code to remove the duplicate which does not work on oclHashcat instead of simply removing the '2nd seen' duplicate.

jfoug commented 8 years ago

Can you point me to documentation on which rules are not oclHashcat supported?

jfoug commented 8 years ago

One thing I could do, is to simply provide the script files. There is nothing I am hiding in there.

I do think that no matter what, we need to have separate commits for each type of duplication found. These are:

This way, it is much easier to validate that everything being removed fits the rule properly, and that ALL are really duplicates.

jsteube commented 8 years ago

WOW this is getting compliated :) You've remove the PR, which is fine. Are you going to create a new one?

jfoug commented 8 years ago

yes. But I will do this step by step, since things are MUCH easier to see if only one type item is done.

By far the WORST offender, is the errant space duplicate problem in dive.rule That one alone is huge. The others do exist, but are not nearly as bad. But some of those are much more complex to see that they are right

jsteube commented 8 years ago

dive.rule was created automatically

jfoug commented 8 years ago

It is full of good rules, but is at least 15% slower (due to duplicates), than it should be.

jfoug commented 8 years ago

308 contains 2 commits.

The first is 100% 'pure' duplicate lines (i.e. lines that are already 100% the same as some prior line).

The 2nd is duplicates found by removal of superfluous white space (space characters). This was done by first tokenizing the rules themselves into proper commands of the right length, then removing all extra white space. This logic replaces the faulty cat xxx.rule | sed 's/ //g' > xxx_missing_ws.rule logic which removed about 40 rules which were not duplicates.

Within comments in the PR, are all scripting code used to find these duplicates, and auto remove them from the rule files.

jfoug commented 8 years ago

I will make no further PR about dupes, until these have been fully accepted. I do not foresee any problem here, since these really are 100% 'pure' duplicate rules.

Once everything is happy here, I will start removing other duplicates, but only 1 'rule type' at a time, so that it is easier to validate only correct duplicate rule lines are removed.

jsteube commented 8 years ago

OK, merged

jsteube commented 8 years ago

What's the next step?

jfoug commented 8 years ago

I am going to work on it. But I want to do it a step at a time, so we can more easily validate that the changes ARE valid. There have been some missteps and finding them is not always easy when there are multiple types of duplication that is much more complex than what the diff program can find ;)

jfoug commented 8 years ago

The next thing I am looking for is s 'reductions'. This can remove some duplicates, but will also change some rules.

The 'reductions' are like this:

sabsbc becomes sbc sabsac becomes sab

This will cause some duplicates, and also causes some rules to be reduced in number of operations. This is required to be done before I start to reorder the order of the s commands. That reorder finds many duplicates. but if these type no-ops, and cycles are there, then the reorder is not valid.

jfoug commented 8 years ago

Unfortunately these are having to be dealt with by hand. I do not have an 'auto' fixer tool for these. I can find them however. These are the rules in dive.rule that I have found:

R0se5s5&  turned into R0se&
s.hx13sh<  turned into s.<x13
slOsOZ  turned into slZ
sk@s@{oB8  turned into sk{oB8
+8sPMsMW  turned into +8sPW
sYfsfut  turned into sYut
sTDo5OsT*  turned into sTDo5O
so!$=s!=  turned into so=$=
so!s!=$=  turned into so=$=
oA[s2dsdX  turned into oA[s2X
s9Cs9W  turned into s9C
sDJ*7AsD@  turned into sDJ*7A
s7ys7)  turned into s7y
sDk*20sDF  turned into sDk*20
s^6s6I  turned into s^I
siNsNk  turned into sik
sOusOG  turned into sOu
sfusue  turned into sfe
sZEsEd  turned into sZd
sXcs^Us^`  turned into sXcs^U
y2s|asaR  turned into y2s|R
sVusV*o1{  turned into sVuo1{
syMsyP  turned into syM
saGsGE  turned into saE
jsteube commented 8 years ago

I'm a bit confused, do you plan to send in another PR?

jfoug commented 8 years ago

Yes I do, but I sent you the big change off list. I wanted to hear from you first, since this was a set of MUCH HARDER to spot duplication. Some are trivial to see, some not so much. I just thought you might want to run some testing yourself, before we accept this PR,

But I can get a PR done, no problem.

Btw, that PR will be the last for dive (unless we go through and look for and remove no-op commands from some rules to speed them up). Once that PR is done, there are NO more duplicates in dive. The method of running data will flush out all of them.

jsteube commented 8 years ago

While looking at some of the optimization, here's one type that I find a bit suspecious:

sZEsEd turned into sZd

Now that looks fine at this point, but if you have a word in your wordlist which contains 'E' the results of the optimized rule is different to the original one.

Anyway, I think this would be an edge case and I think I'll merge anyway. I'll think about

jfoug commented 8 years ago

I see your point, you are 100% correct.

ZERO sZEsEd == ddRO sZd == dERO

That certainly is different. That 'chain' optimization was one I 'saw' by hand, and then coded for it. We certainly can undo them. There were some, but not a huge amount. They should have all been in one of the earlier commits, so they should be easy to undo, and I certainly can get them back in.

You are not seeing these in the current dup set? I would think not... With the data provided I would find it hard to see anything like that not being seen as different.

jfoug commented 8 years ago

One thing I 'should' have done, is with the last set of duplicates, to search the block that were all the same, and chosen the first one of the shortest number of operations. It would have removed more of the 'non-optimal' rules. A lot of them early on were optimal (some just 1 command). But I am pretty sure that some of them were not the best of the group. The program I wrote simply took the first rule of the group (i.e. earliest line number which that 'type' was found).

If you have not merged yet, I 'could' redo the commit, and change the logic to pick the rule with fewest commands.

jfoug commented 8 years ago

I see you have not merged that yet, so if you want me to, I can certainly change the PR to behave that way. It would make better, slightly quicker rules.

jsteube commented 8 years ago

OK, please undo that specific undupe logic.

Do you plan to apply the fewest-function-calls-first optimization on all rules or just dive? If it's just dive, go for it. For other ones, especially the hand-written ones, it's maybe not ideal because the author maybe had a specific reason to sort it in the way it is.

jfoug commented 8 years ago

I fully agree there.. if something in hand written, unless it is wrong (like unforeseen duplicates), then it pretty much should stay intact.

On this last PR, why don't you reject it, and I will put some more time in on it. I will do the 'least' commands change, and will also spend some time looking for things that are not duplicates. I did find some.

duplicate rules:
(115)
(53541)  'V
(100438)  'Z
(105211)  'W

These are not duplicates, but passed my random input data due to line lengths.  I thought I had long enough stuff, but I guess not.

duplicate rules: (non-optimal)
(185)  }[$e
(186)  }$e[
(410)  ]$e
duplicate rules: (non-optimal)
(225)  }o05
(95882)  ]^5
duplicate rules: (non-optimal)
(226)  }o04
(90399)  ^4]
(95884)  ]^4

The min-command logic would have caught the 2nd block. But the optimal speed for block 3 and 4 is probably the last item, since ] and ^ are probably much faster than the } operator.

jfoug commented 8 years ago

I will get the non-duplicate s-chain rules back in where they were removed. That will be quicker. Then the de-dupe of dive.rule I will work on a bit.I am traveling for the next few days, so I may have down time to do this, and I may not, hard to say.

jsteube commented 8 years ago

I'll reject the PR as suggest. Please send in a new one when you have time

jfoug commented 8 years ago

Which of these would you rather see in the rules:

 $6$6$6
 $6Z2
 $6$6$6$6
 $6Z3

In the 2nd set, I can really see using $6Z3 2 commands vs 4. But in the other one, I am not sure how much time savings there is, vs ease of reading the rule by non-techies, OR does that matter at all ?

I actually like the john rule of Az'666' or Az|6666| but that is not the question ;) It gets much easier for pre-pending: A0'123' vs ^3^2^1 but at this time, this is not in HC, due to lack of handling the z (or l, n, m) length values, or the A command.

But there are quite a few of these app/pre-pend the same character multiple times. It may be best to use z or Z for any that are more than 2 same characters. The pick least commands duplicate should catch a large percentage of these, since I bet many of them have been duplicated.

jsteube commented 8 years ago

definitely $6$6$6 and $6$6$6$6

all about easy to read :)

jfoug commented 8 years ago

all about easy to read :)

Seriously ? I thought it was about the fastest way to crack ;)

But I can certainly do $6$6$6....$6 if you want it that way, and leave these readable but non-optimal.

NOTE, some of the reduction of command operations WILL make rules that are more obscure and harder to read, for sure. Many times they become 'better', but sometimes they are fewer commands, but more obscure (like the $6Z3 vs $6$6$6$6) And I know that some commands may be much slower 'in code' than others. ] has to be pretty damn fast vs other commands that may be a bit slower. I would also thing that $ and ^ would be faster than using i but that may not be the case.

Are there any rules which are NOT supported by the GPU that I should NOT be using for some of these ruleset, or be on the lookout for in the rulesets? I have looked in the wiki and other places, and i can not seem to find much about what is NOT in the GPU, other than 256 command limit (which is HUGE, btw).

jfoug commented 8 years ago

323 adds those rules that were NOT duplicates in all cases back in. Note, I did not remove the stripped down rules (just in case they also randomly crack something, lol).

I will spend some time on the dive duplicates found submission. It may end up being more than 1 PR, I do not know. It is huge, and NOT a simple 'rule' I found, but 100's if not 1000's of little things that can cause duplication (like [ or D0 or O01 or {] or ^#[[ etc). These can be trickier to make sure they are dupes (like I mention above, I HAVE found some that are not). Also, getting an auto-written script/program to figure out which of a set of duplicates should be kept AS THE rule, is also not easy. So instead of jamming something in, I will spend a bit of time, and try to get it right, or if still too hard, to break it up into more manageable sizes. But even if we say 1000 rule changes is 'manageable', that means 10 PR's and I think that in itself is a bit much.

jsteube commented 8 years ago

I full agree, such a script is not easy. Anyway, thanks for your effort. Please continue with it.

jfoug commented 8 years ago

I am glad we put off the prior work. There were problems with the script. I added much validity checking, making sure that everything matches perfectly. First off, there were 6 comment lines in the file I was not accounting for. But I have to 'start' the adding of offsets at 7. Then about a third of the way through removing data, the matching line is only offset of 6. Somewhere there must be a rule line that is confusing my dupe finder, but for the life of me, I can not find it. But now that I have the checking logic, that one line really should not matter. (it is one line turning into 2 somehow, but I do not know which one, or why).

Its amazing what can be done when your flight in arrives at 8:55 pm, and the layover departure is at 7:00 am ;) Find all sorts of things to try to make the time pass.....

jfoug commented 8 years ago

Ive been working on this the a bit over the weekend. I think its now pretty solid and safe. I do have it picking the smallest rule. It chooses the lowest command count, and if 2 commands are same count, it chooses the shorter (character wise) of the 2. The exception is for things like: $x$x$x..$x or ^x^x^x,.. Those are kept intact since you wanted them that way, instead of picking the shorter $xZ3 or whatever it is. BUT using shorter length of same number of commands gets you things like ^a^b^c instead of i0ai0bi0c type ones, and i find the ^ to be easier to remember.

I have dug down to a couple thousand lines, and all spot checks are 100% duplicate. There are no more 'hidden' non duplicates that I have seen at all. There are certainly a lot of duplicates which would have been about 100% impossible to spot, without software to simply run the rules against wordlists made to try to flush out the duplication. Rules that do not look ANYTHING close to being the same, but when you dig down and look, they certainly are. There are many that simply have no op stuff in them, such as i0x$a[ That simply is $a due to everything else just being busy work, doing nothing towards the end result. Theres quite a bit of that.

But now, when this all gets done, there will be a little over 98k rules in dive. I think that's about 25% down from where it started. Still the same ruleset. A good one. It will just work that much faster now. Also, I think it will work better, with the 25k x commands now being O commands, which is the 'proper' logic that originally cracked 'something' to get added to the ruleset. You might put a notify out that people may actually want to re-run that ruleset if they have used it since the logic change of x converting into O, or at least to grep the O rules out and run them again.

I will certainly finish up dive this week, and will probably post it as a single PR, and send you offlist the list of work done, and completed rule set so that you can look at it offline. I did re-run the entire ruleset before and after against a 6000 word generated list through john, capturing all output words, then uniquing them and sorting. There were about 100 words not generated after the processing sorting and uniquing, there were about 100 lines not in the new data, but they all had somewhat the same pattern, and all had the same binary garbage in them, and I think it is an issue within john's rule proccessing for some encoding aware case changing. I am not re-running, since it's about 4 hours or processing, and I have to have the laptop plugged in, since it chews up the battery when under that load. I 'know' the rule files are equivalent (certainly for HC), it is just some nuanced issues in john that HC does not deal with (broken unicode letters in the random garbage I generated to find the duplicates). Magnum did a lot of the encoding stuff, and there are times I would like to re-do some of it to better handle arbitrary binary data. The encoding logic too easily screws up malformed input data, where I think we should simply leave it as original. But 99.99% of the time, it does not matter at all.

Well, once I get dive done, I will run my dupe finding code on some of the other rule sets, to see how clean they are. The 2 generated might have some, as might d3ad. The PWP likely has some in it. Their other rule file seems very clean (hand done). It usually is the random generated stuff that is bad about the duplicates, because just like writing code, there are 1000's of ways to say foo-bar, lol.

jsteube commented 8 years ago

There are many that simply have no op stuff in them, such as i0x$a[ ...

I found the example interessting. Are you going to remove the useless work here?

... and completed rule set so that you can look at it offline.

How would I check through 25% removed rules manually. I could do only a few random picks and I'd need to have something to compare, so I'd need at least 2 rules. However, you already said it's 100% correct above, I'll simply trust in that and going to merge.

But now, when this all gets done, there will be a little over 98k rules in dive. I think that's about 25% down from where it started. Still the same ruleset.

Great work, really

jfoug commented 8 years ago

There are many that simply have no op stuff in them, such as i0x$a[ ...

I found the example interessting. Are you going to remove the useless work here?

A lot (most?) of this was removed with the last update. However, if a rule was unique in it's logic, but it still had no-op work in it (offsetting crap, or conditions that can never happen), then it would have been left in. However, the method I used in the script to choose the rule to keep out of the block of duplicate rules, would pretty much assure that all no-work commands were removed (most of the time). The 'keep' rules were:

  1. if 2 rules had differing number of commands, then keep the rule with the fewer commands.
  2. exception to rule 1 was $x$x$x... or ^y^y^y... Those were kept over shorter count rules such as $xZ3
  3. if 2 rules had same number of commands, but different lengths (errant spaces removed for length checks), then the shorter length rule was kept. This would keep things such as ^3^2^1 over i01i12i23

So because of the above 'keeper' rules, most rules that had offsetting or no-op work were scrubbed. So things like i0x$a[ would always get scrubbed, since there was a duplicate $a in the data set. However, if there was i0x$x[i3z$m and there was not a $x$mi3z rule, then the rule that had that extra no-op pair of commands would not have been removed.

Are there any left? I would say there almost certainly are. However, there likely are few of them, being that if it was randomly generated and DID crack something (to get added to the ruleset at some point), then it is highly likely that a more canonical representation of the rule would also be generated and find something and also added to the ruleset at sometime later. This is how almost all of this duplication grew over time. A good 'random' rule like sa4se3so0 was created and cracked a bunch. It gets added. Later, so0sa4se3 gets random generated, and it also cracks a bunch, so it also gets added. However, together, only 1 of those rules will crack something. The other never will.

So to sum up your original question, no I have not specifically written code to look for these 'never can happen' commands (like usa4) or the offsetting work commands (like ^0[ ) . But a large percentage HAVE been found and removed as a side effect of the duplication removal project.

jfoug commented 8 years ago

Here is an interesting block of duplicate rules found in d3ad0ne ;)

duplicate rules:
(5687)  *63*63
(8215)
(14308)  ^>O01
(14681)  ^QD0
(21265)  kk
(26422)  rr
(29907)  tt
(34360)  z3O13
(34733)  z5O15
(35307)  }{

Lots of cycles spinning on those, lol.

An interesting experiment would be to figure out how many 'make work' rules could be made that end up doing no work at all. However, I think that there are infinite ways to use the rules engine to give no end change to word, so I do not think that 'experiment' would get very far.

jfoug commented 8 years ago

It may not be bad to start to 'make' some canonical rules for some of these things.

Things like all work on front of word should be first then all work on rear of word.

So ^1$1 would be canonical, but $1^1 would not.

Or, it may be better to claim that commands should be in ASCII sorted order, whenever there are commands which can switch around. (I like that definition better).

Then things like:

sa&se3so0 would be in canonical order, but se3so0sa& would not l$1 would not be canonical, but $1l would This would not be hard to do within a program. A rule could be re-arranged, and then checked to make SURE it is still a duplicate.

Other things: all O01 should be D0 I0x should be ^x, etc. Actually, all Ox1 should be replaced with Dx Also, DxDyDz if xyz are contiguous, should be replaced with Ox3 (or Oy3, etc, whatever is compatible). There are more of these type multiple ways but one way is better command blocks.

I am not sure that the ROI is really worth it in this case. Once we have reduced a rule set to all non-unique rules, then if they are i a canonical format or not may not matter all that much. Now, if the rule set is being built (especially with random rules), then that would be a different story. But I do not think there are any ongoing random rule accumulation projects. If there are, then it might be smart to address them in some canonical manner, as that will help keep some of the duplication out of the mix. Also, within your rand generation code, it may be possible to build a canonical 'filter' so that when new random rules are generated, at least they are 'standardized' in some manner.

jsteube commented 8 years ago

@jfoug What's the status with the issue? I think you've corrected all the "generated" rulesets, right?

magnumripper commented 8 years ago

Jim is on a road trip all over US. I think he's back next week.