Sending metadata to GSA when doc size > 30M

GoogleCodeExporter commented 9 years ago

Can the connector discover the doc size before retrieving the whole doc? If 
it can, can it send only meta data? The TraversalContext should be used to 
read the doc size limit.

Original issue reported on code.google.com by jeffreyl...@gmail.com on 20 Aug 2009 at 7:44

GoogleCodeExporter commented 9 years ago

Original comment by rakeshs101981@gmail.com on 27 Aug 2009 at 5:14

Added labels: Milestone-2.4

GoogleCodeExporter commented 9 years ago

Initial investigations suggest that the file size is returned as a metadata:
ows_FileSizeDisplay

The following link suggests it is safe to use:
http://blogs.msdn.com/karthick/archive/2006/04/07/570398.aspx

Need more investigation that it can be handled seamlessly

Original comment by rakeshs101981@gmail.com on 3 Sep 2009 at 11:10

GoogleCodeExporter commented 9 years ago

Will use the ows_FileSizeDisplay attribute for determining the file size. This 
will 
also require that SharePointTraversalManager implements 
com.google.enterprise.connector.spi.TraversalContextAware 

The main methods of interest are:
1. maxDocumentSize() --Should be used before opening a stream for the document
2. mimeTypeSupportLevel(String mimeType)
Right now there is no direct way of determining the mimetype. The 
ows_ContentType 
meta-attribute does not return the mimetype
http://blogs.msdn.com/tejasr/default.aspx, 
http://social.msdn.microsoft.com/Forums/en-US/sharepointdevelopment/thread/2bea3
746-
843f-4836-b35e-7b537d6b0a75
3. traversalTimeLimitSeconds() -- This should be used to return from batch 
traversal 
before the traversal thread times-out. Need a bit more analysis as to where 
this 
check should be applied. Cannot be done in startTraversal() and 
resumeTraversal(). Is  
SharePointClient.updateGlobalState or SharePointClient.updateWebStateFromSite 
the 
right place?

Original comment by rakeshs101981@gmail.com on 12 Sep 2009 at 1:23

GoogleCodeExporter commented 9 years ago

For MimeType, i think we'll have to use the HttpClient call. This'll not be an 
extra
work for the connector because it's already beind done in content feed mode. 
Do we need to consider this in case of M&U as this will make the connector fully
TraversalContext aware.

For the third point, we first need to decide on the atomicity during the crawl. 
The
atomicity can be defined either on a site level or a list level. I'll never 
recommend
a document level interrupt as that will introduce lots of complexity and 
probably bugs.

Considering list as an atomic crawl unit may be appropriate. The traversal 
interrupt
can be initiated in the SharePointClient.updateWebStateFromSite method then.

Original comment by th.nitendra on 14 Sep 2009 at 4:03

GoogleCodeExporter commented 9 years ago

As per Connector Manager Issue 143 (http://code.google.com/p/google-enterprise-
connector-manager/issues/detail?id=143)

A new group: 'ignored' mimetypes have been added. If the document mimetype is 
in this 
list, it should be skipped entirely. For this purpose a new exception class: 
SkippedDocumentException has been added to the SPI. The connector should thrwo 
this 
exception for such docs. More details

http://code.google.com/p/google-enterprise-connector-manager/source/detail?r=231
9

Original comment by rakeshs101981@gmail.com on 5 Nov 2009 at 10:34

GoogleCodeExporter commented 9 years ago

Fix details:
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
430
http://code.google.com/p/google-enterprise-connector-sharepoint/source/detail?r=
429

Original comment by rakeshs101981@gmail.com on 5 Nov 2009 at 4:39

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Verified in 2.4 Release

Original comment by ashwinip...@gmail.com on 14 Dec 2009 at 7:03

Changed state: Verified

superdevo / google-enterprise-connector-sharepoint

Sending metadata to GSA when doc size > 30M #101