rajgithub123 / google-enterprise-connector-sharepoint

Automatically exported from code.google.com/p/google-enterprise-connector-sharepoint
0 stars 0 forks source link

Metadata values not fully-indexed (values originating from the ows_MetaInfo state-bag field) #65

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It appears that the support for document and list item metadata 
indexing is not fully baked in rel 1.3. The issue is that the connector 
does not index any list item / document field unless it appears as a 
part of a default view in a list containing the item. While this 
behaviour is actually by design in the Lists.asmx service used by 
connector, the Lists.asmx service allows to work around this limitation 
by pulling metadata values from within a serialized state-bag field 
returned by the service as the ows_MetaInfo field (MOSS 2007 only). The 
fields are delimited by the newline sequence (\r\n) within a single 
attribute value. ex: ows_MetaInfo="Field1:SW|Value1
\r\nField2:SW|Value2";

The support for culling data from this field is implemented in the 
connector version 1.3, however it is not working as designed due to an 
issue with the Axis MessageElement::getAttribute(string) method which 
seems to strip all newlines and returns them as spaces instead. This 
throws the connector code off, which currently tries to break the 
values out by running the metaInfo values through a String.split
("\r|\r\n") method. Naturally, as there are no newlines returned by 
Axis, the code ends up with only one resuting array element after the 
split and forgoes all fields but the first one.

Below is a proposed alternative implementaiton for the 
ListWS::setDocLibMetadata() method based regex's treating the string as 
either newline-delimited or space-delimited (please note the change in 
the signature of the method from accepting an array of strings to a 
single string).  I am hoping this addition will be incorporated into 
the project codebase by the maintainers. Hopefully, this will save 
hours of frustration for somebody else in a similar situation.

On a side note, the white list implementation seems to be somewhat 
perplexing in its purpose. From my review of the sources, it seems to 
be 100% equivalent to the black list behaviour, i.e. in both cases if a 
field shows up either in the white or black list is gets skipped from 
the GSA import... I am guessing in this particular case (Black == 
White) is true ? :)

--- begin code snip for List ListsWS.java ---

private void setDocLibMetadata(SPDocument doc, String metaInfo) {
        String sFunctionName = "setDocLibMetadata(SPDocument 
doc, String metaInfo)";
        LOGGER.entering(className,sFunctionName);
        if((metaInfo!=null)&&(doc!=null)){

            Pattern regex = Pattern.compile("(\\S+):[S,L,E,V][R,W]\\|");
            Matcher match = regex.matcher(metaInfo);

            boolean found = match.find();
            while (found)
            {
                String key = match.group(1).trim();

                int start = match.end();

                found = match.find();
                int end = found ? match.start() -1 : metaInfo.length();
                String value = metaInfo.substring(start, end);

               if (!listMatches(blackList, key))
               {
                   if (value.length() > 0) {
                        LOGGER.log(Level.INFO, "Adding statebag item: " 
+ key + ":" + value);
                        doc.setAttribute("wss:"+key, value);
                    } else
                        LOGGER.log(Level.INFO, "Skipping an empty 
statebag item: " + key );
               } else {
                    LOGGER.log(Level.INFO, "Skipping statebag item due 
to it being blacklisted: " + key + ":" + value);
               }
            }
        }
        LOGGER.exiting(className,sFunctionName);
    }

--- end code snip for List ListsWS.java ---

Original issue reported on code.google.com by valorek...@gmail.com on 2 Mar 2009 at 9:29

GoogleCodeExporter commented 9 years ago
Attaching feed file where value of ows_metainfo field inline <meta 
name="ows_MetaInfo" content="1;#"/> is a result of bad parsing

Original comment by shashank...@gmail.com on 4 Mar 2009 at 12:28

Attachments:

GoogleCodeExporter commented 9 years ago
please clarify your comment.
Thanks, 
Val

Original comment by valorek...@gmail.com on 4 Mar 2009 at 6:34

GoogleCodeExporter commented 9 years ago
The connector discovers the metadata but fails while parsing the content. This
happens only for the content retrieved as under ows_metainfo field and not for 
other
attributes. By The Way, for every custom metadata, SharePoint web service also 
sends
a corresponding metadata attribute apart from the one it sends as in 
ows_metainfo. 

Any document specific custom metadata is also returned as part of ows_metainfo 
field.
The format used to distinguish between various custom metadata included in
ows_metainfo field is not straightforward and hence connector fails while 
parsing the
content of ows_metainfo

Original comment by rakesh.s...@gmail.com on 4 Mar 2009 at 6:57

GoogleCodeExporter commented 9 years ago
to rakesh.shete:

-- By The Way, for every custom metadata, SharePoint web service also sends
-- a corresponding metadata attribute apart from the one it sends as in 
ows_metainfo.

This is not precisely the case. The Lists WS returns custom fields requested in 
the
viewFields argument, or if not specified, fields belonging to a given view. The
Connector code does not specify viewFields nor the viewName when calling the 
Lists WS
thus making the WS default to the fields available in the default view (see 
here:
http://msdn.microsoft.com/en-us/library/lists.lists.getlistitems.aspx) If the 
default
view is configured not to include a certain custom field X, it will not be 
returned
by the WS as part of the ows_ prefixed fields and will only be returned as a 
part of
the ows_MetaInfo statebag field.

-- The format used to distinguish between various custom metadata included in
-- ows_metainfo field is not straightforward and hence connector fails while 
parsing
-- the content of ows_metainfo

The format of the ows_metainfo field is _always_ new-line delimited, as 
returned by
SharePoint. I've confirmed this with packet capture analysis by examining data
returned by Lists WS.  The inconsistency happens when data is parsed by Axis 
before
the Connector code receives it. Please see the specifics in the original post 
for the
proposed regex-based solution that allows you to parse the data regardless of 
the bug.

Val

Original comment by valorek...@gmail.com on 7 Mar 2009 at 3:47

GoogleCodeExporter commented 9 years ago
Connector sends all the metadata discovered under property bag ows_MetaInfo as
separate distinguished attributes. As a result of this, users will no more see 
any
attribute name with the name ows_MetaInfo.

Apart from this, connector also does some little changes in the original 
metadata
names and their values as they are returned by the Web Service:
1. Leading ows_ is removed from the attribute names
2. Leading vti_ is removed from the attribute names
3. _x0020_ is replaced by <space> from the attribute names
4. Leading <ITEM_ID>;# is removed from the attribute values

The purpose of the above changes are to make the metadata names and values more
meaningful to end users.

Original comment by th.nitendra on 24 Jun 2009 at 12:06

GoogleCodeExporter commented 9 years ago
Verified in 2.0.0 and works fine.

Original comment by shashank...@gmail.com on 25 Jun 2009 at 12:10