Closed GoogleCodeExporter closed 9 years ago
If we will not make complete diff of scanner output (it's not so simple task and
anyway not every diff could be simply passed into database), then saving entire
scan
is not required: we only need md5-sum for it.
In which stage most CPU is utilized: when insert queries are executing or
together
with committing? Latter means that we require to save nothing. I'd prefer to
avoid
saving scan dumps before parse for calculating MD5-sum: it could be done during
parse.
By the way, we could also significantly increase performance by:
1. joining several SQL queries into the single one (i.e. joining queries up to
total
length >= 64K)
2. using prepared queries (if psycopg2 allows them)
Original comment by radist...@gmail.com
on 9 Feb 2010 at 1:50
Thanks for hashsum idea - my thoughts were carried away thinking how diffing
trees
could be done.
As for other ideas, do you really believe that connection to database and query
parsing matter while each filename requires entire filenames table to be scanned
(with hash type of index), each filename insertion require tsvector
calculation, each
file insertion requires UNIQUE cheking, each path insertion requires unique
checking
and tsvector calculation?
Do you really believe that?
Original comment by ruslan.savchenko
on 9 Feb 2010 at 3:02
>Do you really believe that?
It appears you were right. Now we have database updating time close enough to
the
host scanning time.
And we use a hash heuristic as well.
Close.
Original comment by ruslan.savchenko
on 10 Feb 2010 at 2:02
Original issue reported on code.google.com by
ruslan.savchenko
on 9 Feb 2010 at 1:32