Closed GoogleCodeExporter closed 9 years ago
Comments from Rakesh on email-thread and feedvback from John and Marty:
"I was reviewing the changes related to the batch size.
I am not inclined to remove the (2*batchHint) cap used by the SharePoint
connector.
Reasons:
• The SharePoint connector tries to fetch bacthHint # of docs from
each
SharePoint List till the total no of docs >= (2*batchHint). This is important
as the
connector follows a greedy approach of discovering docs from as many Lists in
SharePoint as possible in a single batch traversal rather than being sequential
(one
List at a time). If the cap of (2*batchHint) is removed, then, the connector
needs to
be changed to fetch something like (0.5 * batchHint) docs from a given
SharePoint
List so that the connector still looks for docs from multiple Lists rather than
from
a single list. This is important during the initial crawl as the chances that
the
batchHint # of docs being discovered from every list is high.
• Even though the connector returns (2*batchHint) no of docs, the
BatchSize.maximum ensures that host load is being honored. In fact these
changes make
sense with the SharePoint connector’s implementation since the chances of all
(2*batchHint) docs making it through to the GSA in the same batch traversal are
high
as compared to earlier CM versions.
As Marty has pointed out that “a good citizen would follow the 'hint' and not
the
'max' “, looking at the above benefits I am inclined to retain the
(2*batchHint) cap
in the SharePoint connector.
"
Comments from John L:
"
The BatchSize maximum does not ensure the host load is being honored. If the
SharePoint connector really returns 2 * batchHint documents in a DocumentList
per
minute, then the traversal will run at twice the host load.
(There are details there. It will run at twice the host load up to a host load
of
500, and then linearly decline to match the host load at 1000. If it takes more
than
a minute then the extra time decreases the host load overrun.)
We seem to all agree that a good citizen will aim for the batch hint, not the
batch
maximum, but the SharePoint connector is aimed at the batch maximum.
Perhaps a different change would make sense, so that the DocumentList returned
to the
CM does not consistently have 2 * batchHint documents in it, even if internally
the
SharePoint connector tries to get 2 * batchHint candidates. Or we can leave it
as-is.
That's not my first choice, but it's not the end of the world.
I did forget to say, that if we implement Eric's suggestion for a delay when
the
connector exceeds the host load, then the SharePoint connector would not
consistently
exceed the host load over time, even if it did for every batch.
John L
"
Comments from Marty:
"Sounds like SP is not going to change it's behavior so when there's a lot of
new
content it will be exceeding the host load. Not great but on the other hand
it's
only an issue during the initial crawl and after massive updates - both times
load on
the SP servers should be expected."
Original comment by rakeshs101981@gmail.com
on 4 Nov 2009 at 1:51
Final approach:
The current think-tank is to always query batchHint # of docs from each list
and
return whenever the # docs in current batch traversal >= batchHint.
Even with this there are chances that the connector can send (2*batchHint) # of
docs
but that is an extreme case. (Like for batchHint=100, first 99 lists have 1 doc
each
and the 100th list has 100 docs. In such a case the total docs returned will be
199)
This will be rare and not occur frequently.
Original comment by rakeshs101981@gmail.com
on 4 Nov 2009 at 1:52
Fix details:
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
420
Original comment by rakeshs101981@gmail.com
on 4 Nov 2009 at 1:53
Original comment by rakeshs101981@gmail.com
on 6 Nov 2009 at 11:34
Verified in 2.4 Release
Original comment by ashwinip...@gmail.com
on 14 Dec 2009 at 7:00
Original issue reported on code.google.com by
rakeshs101981@gmail.com
on 4 Nov 2009 at 1:29