Scanning algorithm optimization

klmr92 / uguu

Automatically exported from code.google.com/p/uguu

Other

0 stars 1 forks source link

Scanning algorithm optimization #19

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

If we ever want to scan multiple hosts in parallel we first need to
implement some quick mechanism to ensure database update is needed because
right now slowgres is the bottleneck. Updating the database is 3 times
slower than scanning through network and utilizes 100% cpu all the time.
And the database is still very small!

For example this could be done by storing last scanners outputs for all
shares and splitting scanning process in two stages: first low-level
scanner is invoked and it's ouput is saved in a file, then the new file is
compared with the previous one and if they differ, database update is
performed in a usual way.

Original issue reported on code.google.com by ruslan.savchenko on 9 Feb 2010 at 1:32

GoogleCodeExporter commented 9 years ago

If we will not make complete diff of scanner output (it's not so simple task and
anyway not every diff could be simply passed into database), then saving entire 
scan
is not required: we only need md5-sum for it.

In which stage most CPU is utilized: when insert queries are executing or 
together
with committing? Latter means that we require to save nothing. I'd prefer to 
avoid
saving scan dumps before parse for calculating MD5-sum: it could be done during 
parse.

By the way, we could also significantly increase performance by:
1. joining several SQL queries into the single one (i.e. joining queries up to 
total
length >= 64K)
2. using prepared queries (if psycopg2 allows them)

Original comment by radist...@gmail.com on 9 Feb 2010 at 1:50

GoogleCodeExporter commented 9 years ago

Thanks for hashsum idea - my thoughts were carried away thinking how diffing 
trees
could be done.

As for other ideas, do you really believe that connection to database and query
parsing matter while each filename requires entire filenames table to be scanned
(with hash type of index), each filename insertion require tsvector 
calculation, each
file insertion requires UNIQUE cheking, each path insertion requires unique 
checking
and tsvector calculation?
Do you really believe that?

Original comment by ruslan.savchenko on 9 Feb 2010 at 3:02

GoogleCodeExporter commented 9 years ago

>Do you really believe that?

It appears you were right. Now we have database updating time close enough to 
the
host scanning time.

And we use a hash heuristic as well.

Close.

Original comment by ruslan.savchenko on 10 Feb 2010 at 2:02

Changed state: Fixed