savrus / uguu

Automatically exported from code.google.com/p/uguu
Other
3 stars 1 forks source link

Shares with the same tree should be scanned consecuently #47

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When shares with same file tree are detected their next_scan time should be
altered making next scanning of these shares to be done in almost the same
time. This may increase probability of keeping tree_id same.

Trees with size = 0 should be out of this case.

Original issue reported on code.google.com by ruslan.savchenko on 13 Apr 2010 at 7:28

GoogleCodeExporter commented 9 years ago
I think, we should just reschedule next_scan on hash change. It could be done at
except block at spider.py, line 211.
The only problem is violating next_scan displayed in web interface, so this 
should be
documented on faq page (Q: I have smb and ftp servers with the same content. Why
sometines one of them undergoes rescanning before time declared at share page? 
A:
Shares with the same content are rescanned out of schedule if one of them has 
changed).

Original comment by radist...@gmail.com on 13 Apr 2010 at 9:45

GoogleCodeExporter commented 9 years ago
>I think, we should just reschedule next_scan on hash change.

Then write exact SQL command. I've tried already but result was unsatisfying. 
Keep in
mind that trees can converge, deverge and scanning time is unknown a priory

Original comment by ruslan.savchenko on 13 Apr 2010 at 10:03

GoogleCodeExporter commented 9 years ago
"UPDATE shares SET next_scan=now() WHERE tree_id=%s AND size>0" % oldtree_id,
but this query requires additional update for next_scan at spider.py, lines 
127, 181
and 199. The latter could be done with additional query "UPDATE shares SET
next_scan=now()+interval %(i)s WHERE share_id=%(s)s AND next_scan<now()" to 
prevent
double rewriting for next_scan, but I don't think that setting next_scan after
successful scan (and optional update) is wrong.

Original comment by radist...@gmail.com on 13 Apr 2010 at 10:17

GoogleCodeExporter commented 9 years ago
> "UPDATE shares SET next_scan=now() WHERE tree_id=%s AND size>0" % oldtree_id
It doesn't work when 3 spiders is running at the same time though.

Original comment by radist...@gmail.com on 13 Apr 2010 at 10:21

GoogleCodeExporter commented 9 years ago
May be, we need some configurable time limit for low-level scanners?

Original comment by radist...@gmail.com on 13 Apr 2010 at 10:23

GoogleCodeExporter commented 9 years ago
Well, may be you're right and the only we need is modified update query for 
existing
tree case:
{{{
        if size is not None:
            if size > 0:
                cursor.execute("SELECT next_scan FROM shares WHERE tree_id=%(t)s
LIMIT 1", {'t': tree_id})
            if size > 0 and cursor.rowcount > 0:
                cursor.execute("""
                    UPDATE shares SET tree_id = %(t)s, size = %(sz)s, last_scan =
now(), next_scan=%(n)s
                    WHERE share_id = %(s)s;
                    """, {'s':share_id, 't':tree_id, 'sz': size, 'n':
cursor.fetchone()[0]})
            else:
                cursor.execute("""
                    UPDATE shares SET tree_id = %(t)s, size = %(sz)s, last_scan = now()
                    WHERE share_id = %(s)s;
                    """, {'s':share_id, 't':tree_id, 'sz': size})
}}}

Original comment by radist...@gmail.com on 13 Apr 2010 at 10:55

GoogleCodeExporter commented 9 years ago
>May be, we need some configurable time limit for low-level scanners?
I don't like this idea

> "UPDATE shares SET next_scan=now() WHERE tree_id=%s AND size>0" % oldtree_id,
There may be many shares with next_scan < now(). This is not a big deal and can 
be fixed.

> It doesn't work when 3 spiders is running at the same time though.
Maybe quantizing shares by tree_id in the main cycle would help us?

Original comment by ruslan.savchenko on 13 Apr 2010 at 11:00

GoogleCodeExporter commented 9 years ago
> Maybe quantizing shares by tree_id in the main cycle would help us?
Only one share is selected each loop cycle to have a chance of running several 
spider
instances. We shouldn't make assumption about the states of other shares during 
scan.
For example, lookup.py could delete some of them.

> There may be many shares with next_scan < now(). This is not a big deal and 
can be
fixed.
It's not a problem, typically all them will be rescanned soon. Anyway, I don't 
like
that query anymore.

Original comment by radist...@gmail.com on 13 Apr 2010 at 11:11

GoogleCodeExporter commented 9 years ago
>We shouldn't make assumption about the states of other shares during scan.
If patching is implemented spider will have to deal with a pack of shares. 
Otherwise
diverged shares would be patched from one to another without diverging them in 
the
database. Discarding this problem won't do any good because this issue is a step
toward patching.

Original comment by ruslan.savchenko on 13 Apr 2010 at 11:59

GoogleCodeExporter commented 9 years ago
Forget it. With patching we have even more troubles. I'll leave this issue open
because this is really a problem, but patching comes first.

Original comment by ruslan.savchenko on 13 Apr 2010 at 12:09

GoogleCodeExporter commented 9 years ago
close

Original comment by ruslan.savchenko on 24 Apr 2010 at 12:16