googlegsa / manager.v3

Google Search Appliance Connector Manager
Apache License 2.0
10 stars 10 forks source link

TraversalContextAware interface is not used #62

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Read the source.

What is the expected output? What do you see instead?

The TraversalContextAware interface is supposed to implemented by 
TraversalManager 
implementations that want to support restrictions by size or content type. 
There is a 
ProductionTraversalContext class in the connector manager, but it is 
essentially empty. The file 
size and content type limits are not implemented. The setTraversalContext 
method of the 
TraversalContextAware interface is never called.

As an aside, the ProductionTraversalContext is apparently relying on bean 
initialization, which is 
not being done. If the existing instance were passed to setTraversalContext, 
the result would 
probably be a NullPointerException, because both the fileSizeLimitInfo and 
mimeTypeMap fields 
are null by default.

What version of the product are you using? On what operating system?

Connector Manager 1.0.2.

Please provide any additional information below.

Original issue reported on code.google.com by jl1615@gmail.com on 5 Nov 2007 at 10:47

GoogleCodeExporter commented 9 years ago
Needs research on how the GSA-side interactions should work.

Original comment by donald.z...@gmail.com on 18 Apr 2008 at 10:31

GoogleCodeExporter commented 9 years ago
The connector manager should expose the restriction on :
doc size;
extension exclusion;
URL patterns;

The connectors should be restricted by these.

Original comment by jeffreyl...@gmail.com on 24 May 2008 at 1:15

GoogleCodeExporter commented 9 years ago
Suggested solution: Add configuration information to the 
applicationContext.properties for: file size, list of Mime types.

Original comment by jeffreyl...@gmail.com on 18 Jul 2008 at 8:35

GoogleCodeExporter commented 9 years ago

Original comment by Brett.Mi...@gmail.com on 25 Aug 2008 at 4:18

GoogleCodeExporter commented 9 years ago
Description:

This set of changes addresses Connector Manager Issue 62 - Implement 
TraversalContext

The constraints outlined in the Traversal Context do not match up well
to those configured by the GSA.  For instance, the TraversalContext
specifies a max document size, whereas the GSA's maximum document size
depends upon the document type.  The TraversalContext allows weighted
map of mime types to support (or not support), whereas the GSA uses a
list of filename patterns (mostly just extensions).  This usually is
not too much of a problem, but there are a couple of annoying gotchas:
i.e. the GSA does not support any compressed file types - except
compressed Postcript files (.ps.Z or .ps.gz).

I enhanced the TraversalContext SPI slightly, offering a couple of
convenience methods that allow the connector to: 
  - determine the most preferred mime type from a supplied set.
    (The Livelink connector could use this to select from various
    Renditions.)
  - determine the most preferred mime type to supply, given
    a filename extension.  The filename extension can be simple,
    like .doc, or compound, like .ps.Z (see annoying gotcha above).
Both of these return null, if the mimetypes all have non-positive
support levels (they are unsupported).

I added a table of filename extensions to commonly associated
mime types.  This table is a superset of information I gathered
from various sources and is provided as a plain text Resource
that could be edited by the customer, if needed.
License Note: Two of the sources I used were explicitly Apache 2.0
license projects (Apache2, and MimeUtils).  The other two were were
web sites (one at Duke University).  I sent email asking if it was
OK to use information taken off the sites, but have not heard back.
UPDATE: I heard back from Duke University and got the OK.  I was
unable to contact the other web site, so I removed its set of mappings
and replaced it with those from the Apache Tomcat web.xml file.

The TraversalContext is constructed via Spring using a bean definition
in applicationContext.xml.  The bean definition also allows the user 
to modify the TraversalContext constraints to suit their needs.  The
maximum document size, unknown mimetype support level, and the sets
of mimetypes to prefer, support, or not support can all be configured
here.

The mimetype support level code understands the concept of content
type classes - for instances, all audio, image, and video types
can be easily excluded.  This is done by specifying a content type,
sans subtype, in one of the list.  These content type classes are not
exclusive - if a contenttype/subtype entry appears elsewhere, that
more explicit entry takes precedence.  For instance, you could specify:
  ... supported mime types ...
    image/x-asciiart
    ...
  ... unsupported mime types ...
    image

which means that ascii art images are supported, while all other
image formats are not supported.

TODO: The interaction between these content type classes and the
unknownMimeTypeSupport level need to closely examined.
TODO: The location of the content type classes (especially 'application')
in the supported vs unsupported sets needs to be examined.

Change Log:
----------
M 
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/ProductionTraversalContext.java
   - Implement preferredMimeType(Set mimeTypes) as a call to
     MimeTypeMap.preferredMimeType(Set mimeTypes).
   - Implement preferredMimeTypeForExtension(String extension) as a call to
     MimeTypeMap.preferredMimeTypeForExtension(String extension).
   - Initialize FileSizeLimitInfo and MimeTypeMap to defaults to
     avoid null-pointer exception if they are not initialized by
     applicationContext.xml.

M 
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/QueryTraverser.java
   - Set the TraversalContext on the TraversalManager (if it is
     TraversalContextAware).

M 
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/FileSizeLimitInfo.java
   - Initialize the default maxDocumentSize to 30MB.

M 
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
   - Added setPreferredMimeTypes(Set mimeTypes).  This compliments
     setSupportedMimeTypes() and setUnsupportedMimeTypes().  Preferred
     mimetypes are typically plain text or html that require little or
     no format conversion when indexing, whereas supported mimetypes may
     require significant format conversion and text extraction when indexing.
   - Added preferredMimeType(Set mimeTypes) to the interface.  This returns
     the most preferred mimetype from the supplied set.
   - Added preferredMimeTypeForExtension(String extension) to the interface.
     This returns the most preferred mimetype from the set of mime types
     associated with the supplied filename extension.
   - Added mime type preferences even within the same gross support level.
     For instance, IANA-registered '/vnd.*'  subtypes are preferred
     over experimental '/x-*' subtypes.

M 
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalContext.java
   - Added preferredMimeType(Set mimeTypes) to the interface.
   - Added preferredMimeTypeForExtension(String extension) to the interface.

M 
projects/connector-manager/source/javatests/com/google/enterprise/connector/trav
ersal/MimeTypeMapTest.java
   - Fix tests.

M 
projects/connector-manager/source/javatests/com/google/enterprise/connector/trav
ersal/SpringBasedProductionTraversalContextTest.java
   - Fix tests.

M  projects/connector-manager/etc/applicationContext.xml
   - Added bean instantiation for the ProductionTraversalContext, including
     child beans for MimeTypeMap, and FileSizeLimitInfo.

A  projects/connector-manager/etc/ext2mimetype.txt
   - A table of filename extensions to mime types. It is not a unique map -
     an extension may have several mime types associated with it.

Original comment by Brett.Mi...@gmail.com on 12 Sep 2008 at 9:15

GoogleCodeExporter commented 9 years ago
Perhaps I need to clarify my original concerns.

One of my goals in this task was to map the
default supported and unsupported file types
from the GSA Crawl and Index page (specified
as a list of [commented-out or not] filename
patterns/extensions) to a list of supported and
unsupported mimetypes in the connector manager.

For instance, the GSA Crawl and Index page says:
# The following are popular filetype extensions -
# uncomment the lines to disable crawling them
# Microsoft Word
#.doc$
# Microsoft Excel
#.xls$
#.xlw$
# Microsoft Powerpoint
#.ppt$
# Microsoft Access
.mdb$
...

Note that Word, Excel, and Powerpoint filename
extensions are commented out, meaning they will
be crawled; but Access filename extension is not
commented out, meaning that Access databases will
not be crawled.  This corresponds to the following
entries in the Connector Manager's MimeTypesMap:

<property name="supportedMimeTypes">
<set>
 <value>application/msword</value>
 <value>application/excel</value>
 <value>application/powerpoint</value>
 ...
</set>
</property>
...
<property name="unSupportedMimeTypes">
<set>
 <value>application/x-msaccess</value>
 ...
</set>
</property>

I ask you to look at the proposed full MimeTypeMap
near the end of the CM applicationContext.xml here:
http://tinyurl.com/5jfgcu

Mohit is correct.  An entry of "application" is a
catch-all for "application/*" content subtypes not
explicitly specified elsewhere.  For instance,
"application/pdf" is explicitly listed in the
supportedMimeTypes, so it is supported.  But
"application/x-foobar" is not explicitly mentioned
in either the supported or unsupported lists, so
it would be matched by the "application" content
type class catch-all.

The use of content type catch-alls for "image",
"audio", "video", etc is an obvious advantage.
Since we don't support these media types, we
don't need to detail every single existing
(and future) "image/subtype" content type in
the unsupported list.

My original questions relate to how these catch-all
content types relate to the existing "unknown
mimetype support level" mechanism.  The MimeTypeMap
in the above link contains catch-all content entries
for all of the standard content type classes, so it is
very unlikely that anything would end up as "unknown".

Additionally, should the "application" catch-all be
in the unsupported list, or the supported list,
or not in any list?  Leaving off an "application"
catch-all would cause application subtypes not
explicitly mentioned to fall through to the "unknown
mimetype" support level.  I am starting to lean toward
that configuration.  [The same question could be made
of the "text" content type class catch-all.]

The Connector Manager MimeTypeMap does not reflect
end-user edits to the GSA Crawl and Index page,
so if the user wishes to explicitly include or
exclude certain file types, edits will have to
be made in both places.  The ext2mimetype.txt
table can help them map filename extensions used
in the GSA Crawl and Index page to the mime types
specified in applicationContext.xml.

Original comment by Brett.Mi...@gmail.com on 12 Sep 2008 at 9:18

GoogleCodeExporter commented 9 years ago
Let me summarize to make sure we're all clear because Brett's recent comment 
about:

Mohit is correct.  An entry of "application" is a
catch-all for "application/*" content subtypes not
explicitly specified elsewhere. 

was not entirely true in the code I originally reviewed (r937).  In the recent
snapshot (r941) the MimeTypeMap.mimeTypeSupportLevel() method as been altered 
to make
it entirely the case that this behaves as a catch-all as described above.

This summary is based on r941 of the change branch which is the most recent.

1. There are 3 sections of 'declared' mime types (read in from the
applicationContext.xml file) in the MimeTypeMap: 1) Preferred Mime Types, 2)
Supported Mime Types, and 3) Unsupported Mime Types.

2. In addition, if a given mime type is not found among the 'declared' types, 
the
MimeTypeMap also supports a 4th level which is Unknown Mime Type Support Level. 
 The
actual value for this level can be set.

3. When searching for support level: preferred types  > supported types > 
unknown
types > unsupported types (which are less than 0).

4. When searching for support level given a mime type, there is no precedence
involved in checking the sets of 'declared' types - they are all contained in 
one big
hash.  There is, however, a precedence involved in checking for precision - 
exact
matches first and then, if there is no match and a subtype was given, trying to 
match
the root media-type without the given subtype.  [Again for clarity a Mime Type
declaration as I have defined it is <media-type>/<subtype> so in 
application/msword,
'application' is the media-type and 'msword' is the subtype.]

5. The appearance of just the media-type without a subtype in the set of 
'declared'
mime types will act as a catch-all for all subtypes of that media-type that do 
not
appear in any of the other declarations.  So 'application' as a declared mime 
type
will match 'application/*' iff there wasn't an exact match for
'application/<subtype>' in any of the other declarations.

The latest version of the code has a unit test for this but for example:

Declare Supported = ["foo/baz", "bar/baz"]
Declare Unsupported = ["foo", "bar/cat"]

SupportLevel("foo/baz") -> Supported
SupportLevel("foo/rat") -> Unsupported (catch all)

SupportLevel("bar/baz") -> Supported
SupportLevel("bar/zoo") -> Unknown
SupportLevel("bar/cat") -> Unsupported

Now for the resolution - it seems Mohit and I both agree that as currently 
written
(r941) we're good with leaving the catch-all declarations in the 'Unsupported' 
section.

Unless anyone as an objection, let's remove that <??? > comment from the
applicationContext.xml file.

    Marty

Original comment by Brett.Mi...@gmail.com on 12 Sep 2008 at 9:19

GoogleCodeExporter commented 9 years ago
The information in the ext2mimetypes.txt file was collected from various 
sources. 
All but one were explicitly Apache 2 licensed (Apache 2.0, Tomcat 5.5, 
MimeUtils). 
One source of information was a Duke University website.  I wrote asking 
permission
to use the information and was granted it.  (See attached)

Original comment by Brett.Mi...@gmail.com on 12 Sep 2008 at 9:31

Attachments:

GoogleCodeExporter commented 9 years ago
Fixed in revision r944
Additional Documentation is forthcoming.

Original comment by Brett.Mi...@gmail.com on 12 Sep 2008 at 9:45

GoogleCodeExporter commented 9 years ago
r962 | Brett.Michael.Johnson | 2008-10-01 15:42:49 -0700 (Wed, 01 Oct 2008) | 
38 lines

This is a slight reworking of the changes for Issue 62.
When implementing doc example code, I ran into a minor flaw.

However, when fixing the flaw, I had extensive discussion
with John L (the flawed code was at his request), and 
decided that the filename extension to mime type code
should not be in here at this time.  The task is more
suitably done by third party tools such as
 Mime-Util  http://sourceforge.net/projects/mime-util
or
 MagicMimeTypeIdentifier 
http://aperture.sourceforge.net/doc/javadoc/org/semanticdesktop/aperture/mime/id
entifier/magic/MagicMi
meTypeIdentifier.html

See http://fredeaker.blogspot.com/2006/12/file-type-mime-detection.html

Change Log:
----------

M  projects/connector-manager/etc/applicationContext.xml
   - Improved comments in MimeTypeMap configuration.
   - Dropped setting ext2mimetype.txt property.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalContext.java
   - Improved comments.
   - Dropped preferredMimeTypeForExtension() method.
   - preferredMimeType() no longer returns null for unsupported types.

M  projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/ProductionTraversa
lContext.java
   - Dropped preferredMimeTypeForExtension() method.
   - preferredMimeType() no longer returns null for unsupported types.

M  
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
   - preferredMimeType() no longer returns null for unsupported types.
   - Dropped preferredMimeTypeForExtension() method.
   - Dropped configuring filename extension to mime type table.

D  projects/connector-manager/etc/ext2mimetype.txt
   - Removed.

Original comment by Brett.Mi...@gmail.com on 7 Nov 2008 at 12:41

GoogleCodeExporter commented 9 years ago
r1070 | Brett.Michael.Johnson | 2008-11-03 17:36:00 -0800 (Mon, 03 Nov 2008) | 
32 lines

This is a minor follow-on modification to the changes for
Connector Manager Issue 62 - TraversalContext

After discussing the documentation implications with George,
I decided to make a couple of minor adjustments to the 
mime type code:

- Restore the default value of unknownMimeTypeSupportLevel to 1.
  I had changed it from 1 to 2 early in development for a reason
  that eventually didn't pan out.  Currently there is no behavioral
  difference between the two values.  In reality, only values of
  0 or 1 make sense for unknownMimeTypeSupportLevel.

- Rank content types sans subtypes below content types with subtypes.
  For instance, supportLevel("text") < supportLevel("text/x-foo").

Change Log:
M  projects/connector-manager/etc/applicationContext.xml
   - Restore default unknownMimeTypeSupportLevel value to 1
     (This doesn't really change any behaviour.)

M  
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
   - Restore default unknownMimeTypeSupportLevel value to 1
   - Rank content types sans subtypes below any content types
     with subtypes for a given support level.

M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/MimeTypeMapTe
st.java
M  projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/SpringBasedPr
oductionTraversalConte
xtTest.java
   - Fix up expected unknownMimeTypeSupportLevel in tests.

Original comment by Brett.Mi...@gmail.com on 7 Nov 2008 at 12:45

GoogleCodeExporter commented 9 years ago

Original comment by jl1615@gmail.com on 12 Jan 2009 at 3:14