Closed SaschaHeyer closed 5 years ago
To give a little bit more insight into the duplicate *.jars
i've ran the find-dup-jars.sh
shipped with the norconex-http-collector which produces the following output:
xml-apis:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/xml-apis-1.4.01.jar [2019-04-15T11:39:02]
/norconex-collector-http/lib/xml-apis-1.4.01.jar [2017-12-12T22:57:44]
commons-lang:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-lang-2.6.jar [2018-12-17T09:17:38]
/norconex-collector-http/lib/commons-lang-2.6.jar [2017-12-12T22:54:20]
org.eclipse.wst.xml.xpath2.processor:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar [2019-04-15T11:39:02]
/norconex-collector-http/lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar [2017-12-12T22:57:44]
httpcore:
* /norconex-collector-http/lib/httpcore-4.4.6.jar [2017-12-12T23:06:00]
/google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/httpcore-4.4.5.jar [2019-04-15T11:43:08]
jackson-core:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/jackson-core-2.9.6.jar [2019-02-14T08:48:40]
/norconex-collector-http/lib/jackson-core-2.8.1.jar [2017-12-12T23:15:32]
commons-codec:
* /norconex-collector-http/lib/commons-codec-1.10.jar [2017-12-12T22:54:38]
/google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-codec-1.9.jar [2018-12-17T09:17:16]
xercesImpl-xsd11:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/xercesImpl-xsd11-2.12-beta-r1667115.jar [2019-04-15T11:39:02]
/norconex-collector-http/lib/xercesImpl-xsd11-2.12-beta-r1667115.jar [2017-12-12T22:57:44]
commons-io:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-io-2.5.jar [2018-12-17T09:17:16]
/norconex-collector-http/lib/commons-io-2.5.jar [2017-12-12T22:54:34]
norconex-commons-lang:
* /norconex-collector-http/lib/norconex-commons-lang-1.15.1-SNAPSHOT.jar [2019-03-30T23:00:46]
/google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/norconex-commons-lang-1.13.1.jar [2019-04-15T11:39:20]
commons-collections4:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-collections4-4.1.jar [2019-04-15T11:39:06]
/norconex-collector-http/lib/commons-collections4-4.1.jar [2017-12-12T22:54:38]
commons-logging:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-logging-1.2.jar [2018-12-17T09:17:18]
/norconex-collector-http/lib/commons-logging-1.2.jar [2017-12-12T22:54:44]
commons-collections:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-collections-3.2.2.jar [2019-03-27T15:11:44]
/norconex-collector-http/lib/commons-collections-3.2.2.jar [2017-12-12T22:57:40]
commons-configuration:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-configuration-1.10.jar [2019-04-15T11:39:02]
/norconex-collector-http/lib/commons-configuration-1.10.jar [2017-12-12T22:54:36]
log4j:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/log4j-1.2.17.jar [2019-02-14T08:48:40]
/norconex-collector-http/lib/log4j-1.2.17.jar [2017-12-12T22:54:38]
velocity:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/velocity-1.7.jar [2019-03-22T08:57:38]
/norconex-collector-http/lib/velocity-1.7.jar [2017-12-12T22:54:36]
commons-lang3:
* /norconex-collector-http/lib/commons-lang3-3.6.jar [2017-12-12T23:06:30]
/google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/commons-lang3-3.5.jar [2019-04-15T11:39:02]
guava:
* /google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/guava-17.0.jar [2019-05-15T15:34:52]
/norconex-collector-http/lib/guava-17.0.jar [2017-12-12T23:15:26]
httpclient:
* /norconex-collector-http/lib/httpclient-4.5.3.jar [2017-12-12T23:05:56]
/google-cloudsearch-norconex-committer-plugin-v1-0.0.4/lib/httpclient-4.5.2.jar [2018-12-17T09:17:18]
@donghanmiao @TanmayVartak, would it be possible to shade those dependencies inside the norconex-committer-plugin?
@donghanmiao @TanmayVartak, any updates on this? I'd like to hear your opinion on the subject before trying to find another workaround.
Thank you for reporting this, I've filed a bug internally to track this. I will provide an update asap.
We were able to reproduce with version 2.8.0 and 2.8.1. I will post an update when we have a solution
To avoid the issue, please use option 3 (Do not copy source Jar (leave target Jar as is)) when copying the unzipped jars. We will make it the recommended option going forward in the developer documentation.
We will update the old dependencies used in the committer, which may have caused this issue. We expect to include this update in our next release coming in the following weeks.
Hello @donghanmiao, thank you for your support. I'm looking forward to the next release.
@donghanmiao great thanks
Any indication on when a new release will be published?
hopefully this week. we are resolving some issues with internal opensource process, which took longer than expected.
we've put together the new release in https://github.com/google-cloudsearch/norconex-committer-plugin/pull/12, however ,our testing backend is not working properly, we are working on the fix, in case you need the latest version please use this pull request.
Hi @donghanmiao any ETA when we can expect the official release?
Best regards Sascha
released.
Hi @donghanmiao,
we updated the committer to the latest release v1-0.0.5 and the issue still exists. Please see attached the meta files of the crawled pages.
Only the page with the large body contains the test metatag, the other files with small and empty body still missing the metatag.
1563864842933000000-add.meta.txt 1563864842937000000-add.meta.txt 1563864843365000000-add.meta.txt
Best regards Sascha
HI Sascha, did you use a clean installation? what option did you use when installing the jars?
@donghanmiao Confirmed issue resolved 👍
Great thanks
On Tue, Aug 13, 2019, 12:11 AM Sascha Heyer notifications@github.com wrote:
@donghanmiao https://github.com/donghanmiao Confirmed issue resolved 👍
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-cloudsearch/norconex-committer-plugin/issues/11?email_source=notifications&email_token=AK62IVZSAGXVAXRWJDZLZBLQEJNDVA5CNFSM4HNBM3ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4EYJ3Q#issuecomment-520717550, or mute the thread https://github.com/notifications/unsubscribe-auth/AK62IV5H7OTZWMGP5YSHHOTQEJNDVANCNFSM4HNBM3ZA .
We found a issue related to the dependencies which is the committer plugin installing into the lib folder. The issues causes a behavior where meta tags are not getting extracted properly.
The behavior is only reproducible if the body of the page contains a small amount of content. On pages with large content the issue is not reproducible.
To reproduce the behavior please use the following files:
https://storage.googleapis.com/sascha-issue-reproduction/emptyBody.html Contains an empty body, which obvious leads to an empty .cntnt file. But the existing meta tag is not extracted.
https://storage.googleapis.com/sascha-issue-reproduction/smallBody.html Contains a small amount of text in the body, but still the .cntnt file is empty and the meta tag is still not extracted.
https://storage.googleapis.com/sascha-issue-reproduction/largeBody.html After adding more content to the body the content and the meta tag is extracted properly.
The html files contain a meta tag for testing
After some testing we found out that this issue occurs as soon as we install the Cloud Search Norconex HTTP Collector committer plugin.
Steps to reproduce working case
This reproduction step can be used to verify that the meta tag extraction is working properly in the norconex default setup.
Steps to reproduce failure case
This reproduction step can be used to reproduce the error case.
Versions
Keep in mind
If you reproduce the behavior with the Norconex example configuration please keep in mind to remove:
Best regards Sascha