google-cloudsearch / norconex-committer-plugin

Google Cloud Search Norconex HTTP Collector Indexer Plugin
Apache License 2.0
5 stars 7 forks source link

Committer plugin installation causes meta tag indexing issues #11

Closed SaschaHeyer closed 5 years ago

SaschaHeyer commented 5 years ago

We found a issue related to the dependencies which is the committer plugin installing into the lib folder. The issues causes a behavior where meta tags are not getting extracted properly.

The behavior is only reproducible if the body of the page contains a small amount of content. On pages with large content the issue is not reproducible.

To reproduce the behavior please use the following files:

The html files contain a meta tag for testing

<meta name="test" content="test" />

After some testing we found out that this issue occurs as soon as we install the Cloud Search Norconex HTTP Collector committer plugin.

Steps to reproduce working case

This reproduction step can be used to verify that the meta tag extraction is working properly in the norconex default setup.

  1. Install Norconex HTTP Collector (without Cloud Search Norconex HTTP Collector committer plugin)
  2. add the start URLs
    <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
    <url>https://storage.googleapis.com/sascha-issue-reproduction/emptyBody.html</url>
    <url>https://storage.googleapis.com/sascha-issue-reproduction/smallBody.html</url>
    <url>https://storage.googleapis.com/sascha-issue-reproduction/largeBody.html</url>
    </startURLs>
  3. As committer we use the FileSystemCommitter
  4. Start the crawler
  5. open the .meta files for all 3 pages
  6. Search for the meta tag
    test = test
  7. The meta tag can be found in all 3 .meta files
  8. Everything works as expected 👍

Steps to reproduce failure case

This reproduction step can be used to reproduce the error case.

  1. Take the Norconex installation from the previous step and Install the Google Cloud Search Norconex committer plugin.
  2. Delete all files which are previously crawled from the /crawledFiles folder
  3. Start the crawler
  4. open the .meta files for all 3 pages
  5. Search for the meta tag
    test = test
  6. The meta tag can be found only in the file with large content
  7. For the other files the meta tag is not extracted 👎

Versions

Keep in mind

If you reproduce the behavior with the Norconex example configuration please keep in mind to remove:

<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
    <fields>title,keywords,description,document.reference</fields>
</tagger>

Best regards Sascha

joelvoss commented 5 years ago

To give a little bit more insight into the duplicate *.jars i've ran the find-dup-jars.sh shipped with the norconex-http-collector which produces the following output:

xml-apis:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/xml-apis-1.4.01.jar [2019-04-15T11:39:02]
   /norconex-collector-http/lib/xml-apis-1.4.01.jar [2017-12-12T22:57:44]

commons-lang:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-lang-2.6.jar [2018-12-17T09:17:38]
   /norconex-collector-http/lib/commons-lang-2.6.jar [2017-12-12T22:54:20]

org.eclipse.wst.xml.xpath2.processor:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar [2019-04-15T11:39:02]
   /norconex-collector-http/lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar [2017-12-12T22:57:44]

httpcore:
 * /norconex-collector-http/lib/httpcore-4.4.6.jar [2017-12-12T23:06:00]
   /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/httpcore-4.4.5.jar [2019-04-15T11:43:08]

jackson-core:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/jackson-core-2.9.6.jar [2019-02-14T08:48:40]
   /norconex-collector-http/lib/jackson-core-2.8.1.jar [2017-12-12T23:15:32]

commons-codec:
 * /norconex-collector-http/lib/commons-codec-1.10.jar [2017-12-12T22:54:38]
   /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-codec-1.9.jar [2018-12-17T09:17:16]

xercesImpl-xsd11:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/xercesImpl-xsd11-2.12-beta-r1667115.jar [2019-04-15T11:39:02]
   /norconex-collector-http/lib/xercesImpl-xsd11-2.12-beta-r1667115.jar [2017-12-12T22:57:44]

commons-io:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-io-2.5.jar [2018-12-17T09:17:16]
   /norconex-collector-http/lib/commons-io-2.5.jar [2017-12-12T22:54:34]

norconex-commons-lang:
 * /norconex-collector-http/lib/norconex-commons-lang-1.15.1-SNAPSHOT.jar [2019-03-30T23:00:46]
   /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/norconex-commons-lang-1.13.1.jar [2019-04-15T11:39:20]

commons-collections4:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-collections4-4.1.jar [2019-04-15T11:39:06]
   /norconex-collector-http/lib/commons-collections4-4.1.jar [2017-12-12T22:54:38]

commons-logging:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-logging-1.2.jar [2018-12-17T09:17:18]
   /norconex-collector-http/lib/commons-logging-1.2.jar [2017-12-12T22:54:44]

commons-collections:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-collections-3.2.2.jar [2019-03-27T15:11:44]
   /norconex-collector-http/lib/commons-collections-3.2.2.jar [2017-12-12T22:57:40]

commons-configuration:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-configuration-1.10.jar [2019-04-15T11:39:02]
   /norconex-collector-http/lib/commons-configuration-1.10.jar [2017-12-12T22:54:36]

log4j:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/log4j-1.2.17.jar [2019-02-14T08:48:40]
   /norconex-collector-http/lib/log4j-1.2.17.jar [2017-12-12T22:54:38]

velocity:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/velocity-1.7.jar [2019-03-22T08:57:38]
   /norconex-collector-http/lib/velocity-1.7.jar [2017-12-12T22:54:36]

commons-lang3:
 * /norconex-collector-http/lib/commons-lang3-3.6.jar [2017-12-12T23:06:30]
   /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-lang3-3.5.jar [2019-04-15T11:39:02]

guava:
 * /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/guava-17.0.jar [2019-05-15T15:34:52]
   /norconex-collector-http/lib/guava-17.0.jar [2017-12-12T23:15:26]

httpclient:
 * /norconex-collector-http/lib/httpclient-4.5.3.jar [2017-12-12T23:05:56]
   /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/httpclient-4.5.2.jar [2018-12-17T09:17:18]

@donghanmiao @TanmayVartak, would it be possible to shade those dependencies inside the norconex-committer-plugin?

joelvoss commented 5 years ago

@donghanmiao @TanmayVartak, any updates on this? I'd like to hear your opinion on the subject before trying to find another workaround.

donghanmiao commented 5 years ago

Thank you for reporting this, I've filed a bug internally to track this. I will provide an update asap.

donghanmiao commented 5 years ago

We were able to reproduce with version 2.8.0 and 2.8.1. I will post an update when we have a solution

donghanmiao commented 5 years ago

To avoid the issue, please use option 3 (Do not copy source Jar (leave target Jar as is)) when copying the unzipped jars. We will make it the recommended option going forward in the developer documentation.

We will update the old dependencies used in the committer, which may have caused this issue. We expect to include this update in our next release coming in the following weeks.

joelvoss commented 5 years ago

Hello @donghanmiao, thank you for your support. I'm looking forward to the next release.

SaschaHeyer commented 5 years ago

@donghanmiao great thanks

joelvoss commented 5 years ago

Any indication on when a new release will be published?

donghanmiao commented 5 years ago

hopefully this week. we are resolving some issues with internal opensource process, which took longer than expected.

donghanmiao commented 5 years ago

we've put together the new release in https://github.com/google-cloudsearch/norconex-committer-plugin/pull/12, however ,our testing backend is not working properly, we are working on the fix, in case you need the latest version please use this pull request.

SaschaHeyer commented 5 years ago

Hi @donghanmiao any ETA when we can expect the official release?

Best regards Sascha

donghanmiao commented 5 years ago

released.

SaschaHeyer commented 5 years ago

Hi @donghanmiao,

we updated the committer to the latest release v1-0.0.5 and the issue still exists. Please see attached the meta files of the crawled pages.

Only the page with the large body contains the test metatag, the other files with small and empty body still missing the metatag.

1563864842933000000-add.meta.txt 1563864842937000000-add.meta.txt 1563864843365000000-add.meta.txt

Best regards Sascha

donghanmiao commented 5 years ago

HI Sascha, did you use a clean installation? what option did you use when installing the jars?

SaschaHeyer commented 5 years ago

@donghanmiao Confirmed issue resolved 👍

donghanmiao commented 5 years ago

Great thanks

On Tue, Aug 13, 2019, 12:11 AM Sascha Heyer notifications@github.com wrote:

@donghanmiao https://github.com/donghanmiao Confirmed issue resolved 👍

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-cloudsearch/norconex-committer-plugin/issues/11?email_source=notifications&email_token=AK62IVZSAGXVAXRWJDZLZBLQEJNDVA5CNFSM4HNBM3ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EYJ3Q#issuecomment-520717550, or mute the thread https://github.com/notifications/unsubscribe-auth/AK62IV5H7OTZWMGP5YSHHOTQEJNDVANCNFSM4HNBM3ZA .