Pyrit does not scale well for multiple GPUs

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

2 videocard installed, everything installed and running, here a benchmark

#1: 'CAL++ Device #1 'ATI CYPRESS'': 82426.3 PMKs/s (RTT 2.4)
#2: 'CAL++ Device #2 'ATI JUNIPER'': 41805.7 PMKs/s (RTT 2.6)
#3: 'CPU-Core (SSE2)': 655.1 PMKs/s (RTT 3.0)
#4: 'CPU-Core (SSE2)': 691.0 PMKs/s (RTT 2.9)
#5: 'Network-Clients': 0.0 PMKs/s (RTT 0.0)

when I run a real test on 10 million passwords as:

localhost:~# time pyrit -e test -r wpa.cap -i list.txt attack_passthrough
Parsing file 'wpa.cap' (1/1)...
Parsed 5 packets (5 802.11-packets), got 1 AP(s)

Picked AccessPoint 00:0d:93:eb:b0:8c automatically...
Tried 10000000 PMKs so far; 87027 PMKs per second.

Password was not found.

real    2m9.549s
user    5m33.769s
sys     0m30.366s

the PC needs 129 sec to complete 10 million password, it means 77500 PSK/s

I also do another test: create essid, fill the database with passwords an run 
batch, here the result:

localhost:~# time pyrit batch

Connecting to storage at 'file://'...  connected.
Working on ESSID 'TEST'
Processed all workunits for ESSID 'TEST'; 104442 PMKs per second.d.

Batchprocessing done.

real    1m51.233s
user    4m29.477s
sys     0m34.406s

the PC needs 111 sec to complete 10 million password, it means 90090 PSK/s

What is the expected output? What do you see instead?
Expected output is to have in all case the PMK minimum at 120000

What version of the product are you using? On what operating system?
pyrit 276
lunux 2.6.32-5-amd64

Please provide any additional information below.
calpp 0.87 (laast avayable)
ATI drivers 10.7
ATI stream 2.1

note that with previous installation with 

-pyrit r250
-linux 2.6.26-2-amd64
-ATI driver 10.2 

the PMK always was >= 120000

Original issue reported on code.google.com by pyrit.lo...@gmail.com on 1 Aug 2010 at 2:17

Blocking: #261

GoogleCodeExporter commented 9 years ago

> What needs to be done is to implement a kind of triple-buffering between CPU 
and GPU. In this approach, the thread that steers the GPU runs for almost all 
of it's lifetime without ever the need to acquire the GIL. 

The problem isn't solvable by buffering - at the moment CAL++ core supports 
n-buffering ( without need to acquire GIL lock for buffered data ). And triple 
buffering was tested during v3 development ( as well as quadra and more :) ).

The problem is that main pyrit classes can't "produce" data fast enough. This 
isn't issue of latency. Also multiple cards aren't required to get this cpu 
bottleneck. 2.5ghz pentium dual cpu can't feed single 5850 card during 
benchmark.

The bottleneck is located in the "feed queue - queue management - taking data 
from queue" code. All of this processing is done in python and in different 
threads. So if you want to stay with current "design decisions" I really doubt 
it can be solved.

> So, there is a solution for those top 10% users of Pyrit with high-end 
hardware already in my mind :-)
In a year or two it won't be 10% but 90%. So in future it might be deal breaker 
for pyrit. And I really doubt that we have solution.

Original comment by hazema...@gmail.com on 19 Oct 2010 at 12:12

GoogleCodeExporter commented 9 years ago

Someone can explain me what GIL means? thanks.

Original comment by pyrit.lo...@gmail.com on 19 Oct 2010 at 8:21

GoogleCodeExporter commented 9 years ago

Python is an interpreted language. That means the code written by the 
programmer is translated into an intermediate state and then interpreted under 
the oversight of an almighty interpreter-overlord. The overlord is omniscient, 
decides when an object's lifetime has ended, allocates and frees memory etc.
Things become more complicated when threads are involved. Threads may cause all 
sorts of problems as they can modify objects concurrently (e.g. thread 1 adding 
an object to a list of objects which thread 2 is currently deleting).
The programmers of CPython (the main implementation of Python, written in C) 
had to make sure that the interpreter-overlord always stays consistent. This is 
especially true because threads in CPython are real OS-threads which are almost 
not managed any further by the interpreter. The way to solve this was to 
introduce the Global Interpreter Lock (GIL): It ensures that only one thread at 
a time can run interpreted code or access the CPython-API. That way multiple 
threads can't get into each others way, corrupting the state of the other. 
Everyone has to wait to get in line.
The downside is that code written in Python can't execute CPU-bound-code in 
multiple threads. Code written in C and called from Python however can - as 
long as it does not touch the CPython-API.

The GIL has been the source of a lot of controversy and statements like "you 
can't do threads in python". However the GIL is a very, very strong solution to 
a very complicated problem. On the other hand it is not free from drawbacks and 
unwanted side-effects. The way the lock (basically managed by the OS/glibc), 
threads (OS/glibc), signals (OS/glibc) and CPython itself work together can 
cause problems like priority-inversion.

Original comment by lukas.l...@gmail.com on 19 Oct 2010 at 9:08

GoogleCodeExporter commented 9 years ago

sorry, I read with attention previous post (#45) and now I know what GIL means.

Original comment by pyrit.lo...@gmail.com on 19 Oct 2010 at 9:12

GoogleCodeExporter commented 9 years ago

I have not a deep knowledge in thread programming, so high probably I will tell 
a wrong thing but...

What about to force (don't ask me why) pyrit to do single task "one CPU one 
GPU"?
I mean, I have a 4 core CPU and 2 videocards mono-GPU: ok, I will run pyrit 
with some parameter as:

pyrit --cpu=0 --gpu=0 -e test (and so on) &
pyrit --cpu=1 --gpu=1 -e test (and so on) &
pyrit --cpu=2 --gpu=none -e test (and so on) &
pyrit --cpu=3 --gpu=none -e test (and so on) &

in this way I will force run 4 istance of pyrit and each of them should run at 
100% without disturb (to be disturbed) the (from) other tasks.

In this way there will be workaround. Of course the bad side is that I have to 
split my dataset to avoid that different tasks work on same data. More, in case 
of 4 HD5970 there will be 8 GPU but only 4 CPU so this trick will use only 50% 
of avaiable hardware power... but.. hey, it is just and idea :)

Original comment by pyrit.lo...@gmail.com on 19 Oct 2010 at 9:50

GoogleCodeExporter commented 9 years ago

@comment 55 - It is some kind of solution, but it won't help in cases where CPU 
can't feed 1 GPU ( like for me now ). And it's possible that for new GPU 
generation ( like 2x faster ) no CPU will be fast enough to feed it. So some 
fundamental changes are required.

Original comment by hazema...@gmail.com on 19 Oct 2010 at 3:44

GoogleCodeExporter commented 9 years ago

I agree completely. Moreover it is very common that people update their GPUs 
more often than CPUs. As a result someone may have a "old" CPU (multi-core) and 
"top class" GPU. So it is very plausible that pyrit would need two cores fro 
one GPU in such case. hazeman11 can't you "add" some kind of benchmark cal++ 
plugin and put it in a seperate binary? So that we could measure (in PMK/s) 
difference between pyrit results and "possible c/c++ implementation"?

Original comment by mmajchro...@gmail.com on 19 Oct 2010 at 4:41

GoogleCodeExporter commented 9 years ago

Issue 185 has been merged into this issue.

Original comment by lukas.l...@gmail.com on 19 Oct 2010 at 4:59

GoogleCodeExporter commented 9 years ago

#57, good suggestion. It may bring new ideas to Pyrit's main code-tree.

Original comment by lukas.l...@gmail.com on 19 Oct 2010 at 5:24

GoogleCodeExporter commented 9 years ago

> hazeman11 can't you "add" some kind of benchmark cal++ plugin and put it in a 
seperate binary?
I'm thinking about it for 2-3 weeks now :). But didn't have much time to do it. 

I also think about making "null computing core" ( with infinite speed :) ) for 
pyrit. This would be good for estimating max performance of pyrit preprocessing 
part.

But I'll try to do something in a week or two :).

Original comment by hazema...@gmail.com on 19 Oct 2010 at 5:44

GoogleCodeExporter commented 9 years ago

#60, the "null computing core" is already there: cpyrit_null

It does exactly what you want it to do (nothing) and can easily be extended to 
simulate real work (e.g. yielding in it's solve()-function to allow multiple 
instances of it running at a defined "speed" per instance, putting stress on 
the GIL).

You'll have to take out some safety-locks in CPyrit in order to initialize it 
as it corrupts your database right away :-)

Original comment by lukas.l...@gmail.com on 19 Oct 2010 at 6:22

GoogleCodeExporter commented 9 years ago

@comment 56:
At the moment, for pyrit/calpp the fastest monocore GPU is on HD5870. As I 
reported in past, HD5870 (1600 Shader Processor@850Mhz) double the power of 
HD5770 (800 Shader Processor@850Mhz), so the x2 PMK/s gain is lineat with x2 
SP: in other words, now pyrit still able to do his work (when it run on  a 
sigle task, single GPU).
As far as I know, ATI has not plans to create a 3200SP@850Mhz GPU, at least 
before they move to 28nm technology (it means at least 12-18 months).
I don't mean "pyrit is perfect, no change required", but I want to say: "there 
is time to trick/patch pyrit as momentary solution and to re-think all the 
structcure of pyrit from the root for future x2 power GPU".

Of course, it is up to lukas to decide the way to follow.

Original comment by pyrit.lo...@gmail.com on 20 Oct 2010 at 8:25

GoogleCodeExporter commented 9 years ago

ATI 69xx cards are supposed to be available in <2 months. At the moment 6870 
with ~1100 shaders is faster then 5850 with 1440 shaders. So I'm not sure if we 
have 12-18 months before 2x speed up.

Original comment by hazema...@gmail.com on 20 Oct 2010 at 12:05

GoogleCodeExporter commented 9 years ago

I agree with hazeman11 but let's assume that pyrit.over is right. It would mean 
that for MONOCORE GPUs pyrit's architecture doesn't have to be changed in order 
to use their computational power. As a result we would have program that 
supports multiple GPUs, supports network clients works well only for single GPU 
configurations... I thought that pyrit is developed to use full computational 
power of the system (CPUs,GPUs) to calculate PMKs... Moreover I believe it is 
just the begining. More and more people that use pyrit will notice low (or as 
in our example even lack) increase of brute force speed after buying additional 
GPUs. There will be a lot of complaining, lack of understanding of pyrit/python 
nature and so on. I my opinion (of course if hazeman11 benchmark confirms it) 
it is last moment to change current architecture. Lukas isn't it possible to 
leave pyrit as it is but move only "PMKs managment" to some C module?

Original comment by mmajchro...@gmail.com on 20 Oct 2010 at 3:19

GoogleCodeExporter commented 9 years ago

To work on HD4850 is not the best hardware to see the structural limits of 
pyrit. I think time is came for lukas to open a paypal account, so we can give 
money to allow him to buy a couple of high level ATI cards. He deserves them: 
donation of 10 Euros will not kill noone of us....

Original comment by pyrit.lo...@gmail.com on 20 Oct 2010 at 3:29

GoogleCodeExporter commented 9 years ago

I am willing to perform tests on your hardware. I know lukas was interested but 
we didn't quite discuss any details yet. Anyway I am willing to help :)

Original comment by mmajchro...@gmail.com on 20 Oct 2010 at 3:32

GoogleCodeExporter commented 9 years ago

Sorry I mean "our hardware" not your ;) Typo :)

Original comment by mmajchro...@gmail.com on 20 Oct 2010 at 3:33

GoogleCodeExporter commented 9 years ago

@63: I learned that the vaule of "if", "maybe" "supposed" and so on are less 
than zero.
Do you have tested pyrit on these 6870 with pyrit or you just read some web 
site? I read some web site's report, but I dont trust at all (I still remember 
all the fucking hype they did for FERMI...)

Original comment by pyrit.lo...@gmail.com on 20 Oct 2010 at 4:11

GoogleCodeExporter commented 9 years ago

There should not be any dogmas involved in free software and Pyrit is no 
different. For any open-source project of a certain size the time comes when it 
grows out of the reach of it's original developer to oversight all aspects of 
design and implementation. For any contributor, just as for myself, this may 
however involve developing "into the blue" and work which may or may not end up 
as a solution in Pyrit's source-tree. This is no different from any other 
non-trivial open-source project.

The unquestionable core about what I created and called "Pyrit" are Python, 
"free as in freedom", a (aspired) quality of the code and constraints in the 
conflict between de-facto being a "hacking tool" and a technological project 
(see the second clause from the bottom on the main page). Within these 
boundaries, I'm perfectly willing to discuss and accept changes and new 
developments.

The bottom line is: Pyrit needs your suggestions; it is an open project. But we 
also need horsepower on the road with people being able to outline designs and 
writing actual code. I neither can do all the "thinking" nor all the coding on 
my own. This is especially true because I'm perfectly able to be proven just 
wrong about things!

@pyrit.lover: Accepting donations is a difficult topic for free (as in freedom) 
open-source projects. During my time with Pyrit, I've already turned down 
several offers of donatations or paid, specific work on it. Money is a game 
changer that - once involved - gives a completely different taste to everything 
here. Right now I'm perfectly able to accept, turn down or just ignore all 
contributions made to the project. Accepting money (or anything of value) would 
change that while I'd like to keep it as it is (especially the "ignore" part 
:-)).

Original comment by lukas.l...@gmail.com on 20 Oct 2010 at 7:22

GoogleCodeExporter commented 9 years ago

well, more or less pyrit is reaching is final limit, because of lack in python 
or because it is interpretated language and so on. As reported in posts, we see 
hardware is growing and pyrit will not be able to serve it. Moving to pure C it 
seems not to be the path to follow.... so, what else? What is the plan? I ask 
this because I am worry this software will be not able to grow... I am not 
complain, I feel as uncle that worres because his nephew does not study enough 
at school, but he wishes nephew will got Nobel for Medicine in future.

Original comment by pyrit.lo...@gmail.com on 27 Oct 2010 at 4:50

GoogleCodeExporter commented 9 years ago

Issue 208 has been merged into this issue.

Original comment by hazema...@gmail.com on 20 Nov 2010 at 1:50

GoogleCodeExporter commented 9 years ago

Wouldn't it be a temporary solution for people having twice the amount of 
physical cpu cores than gpu cores to split the computing/preparation task? In 
this case, 4 cpu cores would be used to handle 2 highend gpu cores...

Original comment by kopierschnitte@googlemail.com on 20 Nov 2010 at 8:50

GoogleCodeExporter commented 9 years ago

Can we sort out every kind of file-i/o bottlenecks in this issue? I know, this 
has been questioned before but I have done another set of tests  with taking a 
look at iostat and I've noticed over 5000tps (in peaks above 7000). I dont 
think, regular SATA HDDs  could handle this heavy load...
In addition, there's a high iowait% while running the batch command.

Original comment by kopierschnitte@googlemail.com on 28 Nov 2010 at 7:09

GoogleCodeExporter commented 9 years ago

@73 Just a suggestion: did you tray to put souce data in /dev/shm? It is the 
ram disk, an nothing could be faster that it.

Original comment by pyrit.lo...@gmail.com on 29 Nov 2010 at 1:40

GoogleCodeExporter commented 9 years ago

About speed up pyrit: I see there is a
http://morepypy.blogspot.com/2010/11/pypy-14-ouroboros-in-practice.html
Maybe we can can test with pyrit to see if it is possible to get better 
performances than python2.6.

Original comment by pyrit.lo...@gmail.com on 29 Nov 2010 at 1:42

GoogleCodeExporter commented 9 years ago

@74: Ok, done that ... No improvments spotted :-(

Original comment by kopierschnitte@googlemail.com on 1 Dec 2010 at 5:22

GoogleCodeExporter commented 9 years ago

I don't know how many times do I have to repeat myself. The problem is with 
pyrit's core (lack of multi-core support)... That's main reason for all 
performance problems. Of course you may minimize the impact by "speeding up" 
other parts of pyrit but that's not the solution. On our test environment we 
have checked different configuration (hardware and software) and "scalling 
problem" is the main issue .

Original comment by mmajchro...@gmail.com on 2 Dec 2010 at 12:15

GoogleCodeExporter commented 9 years ago

I've always read and understood your comments and yes, I'm also aware of the 
performance issues caused by Python. I was only reporting about a high IO-load 
when using pyrit with a "real" database instead of synthetic benchmarks. Under 
different circumstances I would really suspect that a 20% iowait state would 
slow things down dramatically but this wasn't the cause in this case.

Original comment by kopierschnitte@googlemail.com on 2 Dec 2010 at 4:02

GoogleCodeExporter commented 9 years ago

has somebody done any last tests? Any improvements with new pyrit 0.3.0, kernel 
2.6.33-35 and ati drivers 10.11?

Original comment by elec...@gmail.com on 9 Dec 2010 at 3:06

GoogleCodeExporter commented 9 years ago

Tested it with kernel 2.6.35-26 with ati 10.11 and pyrit 0.4.0 svn ... no 
improvements found (as expected). I am still getting 140k PMKs during benchmark 
and 70k under "real" conditions. But, if you follow the entire thread, neither 
the kernel nor the ati drivers will fix this problem.

Original comment by kopierschnitte@googlemail.com on 9 Dec 2010 at 5:45

GoogleCodeExporter commented 9 years ago

Out of intrest what is the performance drop if using cowpatty passthrough?

Original comment by james0p0...@googlemail.com on 13 Dec 2010 at 5:07

GoogleCodeExporter commented 9 years ago

As I'm using passthrough (but without cowpatty), the performance drop is 
exactly 50%. For me, it doesn't matter if I use passthrough or attack_db.

Original comment by kopierschnitte@googlemail.com on 13 Dec 2010 at 5:25

GoogleCodeExporter commented 9 years ago

As I'm using passthrough (but without cowpatty), the performance drop is 
exactly 50%. For me, it doesn't matter if I use passthrough or attack_db.

Original comment by kopierschnitte@googlemail.com on 13 Dec 2010 at 5:26

GoogleCodeExporter commented 9 years ago

@kopierschnitte
try to set the limit_ncpus to the same number of gpu you have.
I mean, if you have a CPU 4 core and a HD5970 (that has 2 gpu)
then set limit_ncpus = 2.
This should avoid pyrit to run 4 threads for the 4 cpu you have.
Please try and report if you have better results.

Original comment by pyrit.lo...@gmail.com on 10 Jan 2011 at 11:21

GoogleCodeExporter commented 9 years ago

Yes, I have done this a few months ago and the "real" results (not benchmark) 
are a little bit better when limiting the number of cpu cores to 2 (or even 
1!). Currently, I get ~70k PMK/s without limiting and ~85k PMK/s with limiting 
the number of cores to 1. When using limit_ncpus = 1 I get a few thousand PMKs 
less.

But at the moment, I feel like the speed varies from day to day :-(

Original comment by kopierschnitte@googlemail.com on 10 Jan 2011 at 6:57

GoogleCodeExporter commented 9 years ago


hi. how i could improve my benchmark. on evga680i+intel q6600+hd5970=

Computed 64974.07 PMKs/s total.
#1: 'CAL++ Device #1 'ATI CYPRESS'': 33718.4 PMKs/s (RTT 3.4)
#2: 'CAL++ Device #2 'ATI CYPRESS'': 36587.5 PMKs/s (RTT 2.5)
#3: 'CPU-Core (SSE2)': 359.0 PMKs/s (RTT 3.4)
#4: 'CPU-Core (SSE2)': 393.4 PMKs/s (RTT 3.3)

 can someone write some tips like.. run without graphical interface, run on x64 linux, overclock cpu or gpu or... limit cpus.. btw where i can to do that limit_ncpus? i dont want to brake record :) i just feel i can get more from my pc. 
pc. to setup pyrit i used this tutorial: 
http://www.backtrack-linux.org/forums/backtrack-howtos/33227-ati-driver-|-stream
-sdk-2-2-opencl-1-1-|-cal-|-cpyrit_calpp-|.html

tnx for your time.

Original comment by minde.pi...@gmail.com on 25 Jan 2011 at 3:03

GoogleCodeExporter commented 9 years ago

Sometime ago I have noticed that the bigger buffer size equals slower 
execution. Most probably python have problems with huge amount of memory 
allocations. I think that it could be possible speed up pyrit by looking 
carefully at the queue management code. Maybe different data structures would 
reduce burden on CPU. But the python code is Lukas baby so I won't do anything 
there.
For now I'm informing that calpp core uses by default smaller buffer size. So 
don't be surprised to see RTT around value 1.0. Also in new version there are 
small improvements which should give some speedup.

Original comment by hazema...@gmail.com on 4 Feb 2011 at 3:53

GoogleCodeExporter commented 9 years ago

My idea is to completely change the way we do scheduling to the following, a 
generalization of how the CALPP-plugin operates:

- The main scheduler class is a thread on its own. The class takes passwords to 
a queue and activly sends portions of that to the devices by calling an 
enqueue-function on the device-class. Notice that the current design is a 
pull-layout where the devices call the scheduler; the new design is a 
push-layout where the devices get activly called.

- The device-class has an internal queue that is protected by it's own lock. 
The enqueue-function prepares a set of passwords by doing the first round of 
HMAC on the CPU and transfering the result to the device-memory. These prepared 
workunits are kept in the device-queue.

- Every device has it's own thread that let's go of the GIL, locks on the 
queue, gets workunits and can immediately execute the kernel. The device-thread 
does not have to re-acquire the GIL as long as it's queue is filled.

- The scheduler polls the device from time to to time (e.g. every 100ms) to get 
back finished workunits, reconstructs the original workunit-layout (e.g. one 
block of a thousand passwords may have been spread over several devices) and 
returns the result on a call to it's dequeue-function. The polling does in fact 
introduce some lag; we can handle this however as the device keeps executing 
during that time.

This approach solves the two problem that I think the current design has: 
First, the preperation of a workunit and transfer from and to the device can be 
truly coalesced with execution time on the GPU. Second, acquiring the GIL 
introduces an uncertain amount of lag until it becomes available. The new 
layout allows the device-thread to let go of the GIL and basically only 
re-acquire it when it runs out of work and needs to return to the python 
interpreter when shutting down. The time it takes to cycle execution of the 
kernel is basically zero.

Original comment by lukas.l...@gmail.com on 4 Feb 2011 at 8:08

GoogleCodeExporter commented 9 years ago

I was checking oclHashcat on our machine. If understand correctly to compute 
single PMK we have to perform 4096 SHA1 computations? oclHashcat was able to 
compute from 8,000M SHA1/s to 17,000M SHA1/s. Even if computing single PMK 
requires 8*4096 SHA calculations I still should be able to get arround 244PMK/s

Original comment by mmajchro...@gmail.com on 9 Feb 2011 at 8:17

GoogleCodeExporter commented 9 years ago

It is acutally 16.384 rounds of SHA1 per key. You are very welcome to supply a 
better SHA1-kernel and solve issue 66

Original comment by lukas.l...@gmail.com on 9 Feb 2011 at 8:27

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

hat is the point of optymizing kernel source code if on my machine pyrit is 
unable to fully utilize current one? I have no actual way of testing it :)

Original comment by mmajchro...@gmail.com on 12 Feb 2011 at 9:32

GoogleCodeExporter commented 9 years ago

@comment 88
I'm not saying that changes you want to make aren't good. It's probably better 
design for task pyrit is now facing. But I don't think it will solve the 
problem. All your changes are designed mostly to solve scalability issues for 
the multi-core system.

The really big issue is that queue management code is simply too slow ( or main 
scheduler class in new design ). And the problem exist for even for 1 core 
system.

I've done some test with pseudo-null core ( striped calpp core ).

fully null core ( just taking data from python stuctures )
On 2.5Ghz pentium pyrit achived 440K pmk/s.

partial null core ( taking data + initial data preparation done on CPU )
On 2.5Ghz pentium pyrit achived 150K pmk/s.

Changing number of null cores doesn't change anything - pyrit can achive always 
the same speed. Number of CPU cores used is also always the same ( only 1 core 
used - other cpu cores are idle ) - this shows that GIL lock is an issue for 
pyrit.

So in ideal case ( no data preprocessing on CPU ) - pyrit can do 440K pmk/s - 
as password can have max length 64B it gives <28MB/s. This is really bad value. 
I think that queue handling code and data structures need big redesign.

So back to the changes you proposed. If you just enclose current queue handling 
code to it's own independent ( without GIL lock issues ) thread  you can expect 
~3x speedup. I don't think it's enough to make pyrit future-proof.

Original comment by hazema...@gmail.com on 12 Feb 2011 at 10:17

GoogleCodeExporter commented 9 years ago

we can probably do much better than that using better data structures. There is 
to much cut/paste and casting going on; also, we don't need every password to 
live as it's own object but can use a more optimized container object. All of 
this, however, is a secondary task.

Also remember that the time on the gpu and time on the cpu are truly 
independent with this design. If the gpu has three seconds of time before it 
needs the cpu again, we can actually handle 440*3 (to stay with your example).

For going beyond that, I think Pyrit's database / processing design needs a 
completely different layout

Original comment by lukas.l...@gmail.com on 12 Feb 2011 at 10:58

GoogleCodeExporter commented 9 years ago

> we can probably do much better than that using better data structures. There 
is to much cut/paste and casting going on; also, we don't need every password 
to live as it's own object but can use a more optimized container object. All 
of this, however, is a secondary task.

For me it's not secondary task. It's primary. For me new architecture must be 
able to sustain new GPU's that will be available quite soon. Also it should 
allow to squeeze all the performance out of current GPU. The changes you 
propose won't cut it.

> Also remember that the time on the gpu and time on the cpu are truly 
independent with this design. If the gpu has three seconds of time before it 
needs the cpu again, we can actually handle 440*3 (to stay with your example).

First of all they aren't. GPU drivers require quite a lot of CPU cycles ( at 
least ATI drivers ). So the final performance is much lower than that. Beside 
the ~3x I'm talking about isn't 440*3 but the 150*3. If you do some analysis 
you will see that new design can achieve only as much as current design with 
fully null core. And 440K pmk/s as limit to pyrit isn't too much. Also if you 
include other losses ( due to drivers , etc ) it's quite obvious that new 
design will achieve much less than 440K. And like is said imho it's really not 
enough to make pyrit future-proof.

Original comment by hazema...@gmail.com on 13 Feb 2011 at 12:34

GoogleCodeExporter commented 9 years ago

I'm testing some changes to the preprocessing loop in computing cores. At the 
moment I see ~180% of CPU usage with 2 computing cores. This "fix" also allows 
some ( not optimal ) scaling with increasing number of CPU cores.

Current preprocessing loop in all computing cores looks like this

start of core solve function/block all python threads
while data available do
  take data from python 
  openssl computations ( really time consuming )
done
unblock threads
start gpu/cpu computations
end of solve

The version I'm testing now is

start of core solve function/block all python threads
while data available do
  take N data from python to C array
  unblock python threads
  do N openssl computations from C array 
  block python threads
done
unblock threads
start gpu/cpu computations
end of solve

The problem for now is selection of N. I'm achieving good results with N>10000 
( acquiring GIL is obviously time consuming ). But I'm not sure if so big value 
will fit all CPUs. 

This change also solves the "problem" of big buffers. In current solution 
preprocessing of huge buffers for fast GPUs blocks queue management/data 
gathering thread for too long - causing some strange interaction which 
translates into reduced performance.

Original comment by hazema...@gmail.com on 13 Feb 2011 at 11:35

GoogleCodeExporter commented 9 years ago

Have you finished your pure C/C++ cal benchmark? We were discussing it some 
time ago. In that way we would be able to just check on few machines how much 
faster pyrit needs to be...

Original comment by mmajchro...@gmail.com on 13 Feb 2011 at 12:21

GoogleCodeExporter commented 9 years ago

Hi.
I have prepared two versions of SHA1. Both are based on pyrits one. I was 
testing them using my benchmark (calculating 5 times of 633328 hashes). Kernel 
bak_sha1_normal.cl was 25% faster on my machine then original one whereas 
bak_sha1_int4.cl was 33% faster. The kernels are modified for benchmarking 
purposes so they will not work out of the box with pyrit. Anyway wanted to show 
you my ideas. Maybe someone will make them even faster :)

Original comment by mmajchro...@gmail.com on 21 Apr 2011 at 9:01

Attachments:

GoogleCodeExporter commented 9 years ago

Guys we MUST fix this issue to make pyrit a powerful tool in the future. Forget 
about legacy devices and the like. People who do pyrit will most likely get 
themselves a nice 6990 and go from there. 

I also suggest leaving donations for Lucas so he can work with this expensive 
hardware too. 

PERFORMANCE MUST SCALE ON HIGH END HARDWARE !
DONATIONS MUST BE ACCEPTED ! 

Thank you very much !

Original comment by jukanma...@gmail.com on 10 May 2011 at 12:46

GoogleCodeExporter commented 9 years ago

Guys, there are some news?
This problem start to be heavy and pyrit has not update in last 6 months: maybe 
it is ongoin a massive revrite of thge code? Plese inform us.

Original comment by pyrit.lo...@gmail.com on 4 Nov 2011 at 2:58

pouliot / pyrit

Pyrit does not scale well for multiple GPUs #173