Closed GoogleCodeExporter closed 9 years ago
Follow-up discussions:
http://stackoverflow.com/questions/2430244/making-gwt-application-crawlable-by-a
-
search-engine
http://groups.google.com/group/google-web-
toolkit/browse_thread/thread/15a922e701e9e2db?hl=en
Related branch:
http://code.google.com/p/google-web-toolkit/source/browse/branches/crawlability/
Related code-review:
http://groups.google.com/group/google-web-toolkit-
contributors/browse_thread/thread/88d4983324d328c5
Related issue in HTML Unit:
http://sourceforge.net/tracker/index.php?
func=detail&aid=2962074&group_id=47038&atid=448269#
Original comment by philippe.beaudoin
on 27 Mar 2010 at 5:45
Original comment by philippe.beaudoin
on 27 Mar 2010 at 5:50
Bumping to V0.2 pending progress on HTML Unit.
Original comment by philippe.beaudoin
on 1 Apr 2010 at 6:23
This is now available in the trunk. See PuzzleBazar revision 151 for an example
and
for all the required jar. Briefly, here is what you need to do:
- Copy all the required jar to your /war/WEB-INF/lib folder.
- All the required jars are in htmlunit-package.zip from the download section.
- Alternatively, get htmlunit-r5662-gae.jar from the download section and see:
http://htmlunit.sourceforge.net/dependencies.html
for the other dependencies.
- Rename your .html to a .jsp (otherwise filters are skipped)
- Make the change to .jsp in your web.xml
- Make sure all your crawlable name tokens start with !. i.e.
MyApp.jsp#!MainPage
- In your configureServlets method, before the call to serve(), add:
filter("*.jsp").through( CrawlFilter.class )
That's all!
To test:
- Compile with the GWT compiler (it doesn't work in development mode)
- Run as... Web application
- Navigate to:
http://127.0.0.1:8888/YOUR_APP.jsp?_escaped_fragment_=DESIRED_NAME_TOKEN
You should get a static version of the page.
For more details on how to configure your application so that it is discovered
by
the spider, see:
http://code.google.com/web/ajaxcrawling/
Original comment by philippe.beaudoin
on 13 May 2010 at 9:26
Not quite done yet, it fails frequently on AppEngine.
Now you need to provide the WebClient, in the ServletModule:
@Provides
@SessionScoped
WebClient provideWebClient() {
return new WebClient( BrowserVersion.FIREFOX_3 );
}
You can also specify the desired timeout for HTMLunit (in ms):
bindConstant().annotatedWith( HtmlUnitTimeout.class ).to( 15000L );
Original comment by philippe.beaudoin
on 13 May 2010 at 9:14
I think I'm about to nail that one. From:
http://code.google.com/appengine/docs/java/urlfetch/overview.html#Requests
It looks like a servlet is prohibited from making a request to its own URI.
There must
be a way to work around that restriction.
Original comment by philippe.beaudoin
on 15 May 2010 at 9:30
More details...
The problem doesn't seem to be accessing the same URI, but rather the fact that
the
request is not served by a different thread. More details:
I request:
http://puzzlebazaar.appspot.com?_escaped_fragment_=main
In App Engine log I see the following:
The request for http://puzzlebazaar.appspot.com?_escaped_fragment_=main starts,
it
goes through the CrawlFilter, and then tries to request the URL
http://puzzlebazaar.appspot.com#!main, and then I get an IOException :
com.philbeaudoin.gwtp.crawler.server.CrawlFilter logStackTrace:
java.util.concurrent.ExecutionException: java.io.IOException: Timeout while
fetching:
http://puzzlebazaar.appspot.com#!main
This exception is caught, the servlet continues and terminated normally.
THEN, the request for http://puzzlebazaar.appspot.com#!main starts, but from
the
timing it's already too late, the requesting servlet has terminated.
- - - - -
As a side note, if I request a static page (say
http://puzzlebazaar.appspot.com/staticpage.html) then everything works.
So it really seems to me that the filters/servlets in charge of handling
http://puzzlebazaar.appspot.com#!main do not start, my analysis is that App
Engine
servlet container refuses to spawn a new thread for that new request.
From the logs, I see the following:
- www.
Original comment by philippe.beaudoin
on 16 May 2010 at 10:58
Bumping up. This is blocked for realease 0.2, at least on AppEngine. It should
work on
other servers, but I can't really test it.
Original comment by philippe.beaudoin
on 17 May 2010 at 5:25
Original comment by philippe.beaudoin
on 18 May 2010 at 6:31
Still no new on this one. Posted on AppEngine:
http://groups.google.com/group/google-appengine/browse_thread/thread/28a9f9737b1
b26b5
Original comment by philippe.beaudoin
on 27 Jun 2010 at 5:49
Htmlunit 2.8 supports app engine, apparently.
Original comment by matt2224
on 18 Aug 2010 at 7:37
Excellent! I compiled HtmlUnit with a patch to support AppEngine before, but
encountered an internal problem in AppEngine. I'll take a look soon to see if
every problem resolved itself magically!
Original comment by philippe.beaudoin
on 18 Aug 2010 at 8:06
Great! Please keep us updated on your findings. Crawling is pretty darn
important in todays online world.
Original comment by matt2224
on 19 Aug 2010 at 12:29
Chris Jacob proposed to do an AppEngine-based service for anybody who needs
HTMLunit to process their page. This strikes me as a brilliant idea that could
already be done even with the current problem discussed in this thread.
Somebody should do it!
http://stackoverflow.com/questions/3517944/making-ajax-applications-crawlable-ho
w-to-build-a-simple-web-service-on-google-a
Original comment by philippe.beaudoin
on 19 Aug 2010 at 2:02
http://code.google.com/p/gwt-platform/issues/detail?id=1#c4
I've just tried out what you said here, but with HtmlUnit 2.8 on GAE, and I get
a server error.
Uncaught exception from servlet
java.lang.NoSuchMethodError:
com.gargoylesoftware.htmlunit.WebClient.pumpEventLoop(J)I
at com.gwtplatform.crawler.server.CrawlFilter.doFilter(CrawlFilter.java:124)
at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at com.google.apphosting.utils.servlet.ParseBlobUploadFilter.doFilter(ParseBlobUploadFilter.java:97)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at com.google.apphosting.runtime.jetty.SaveSessionFilter.doFilter(SaveSessionFilter.java:35)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at com.google.apphosting.utils.servlet.TransactionCleanupFilter.doFilter(TransactionCleanupFilter.java:43)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
at com.google.apphosting.runtime.jetty.AppVersionHandlerMap.handle(AppVersionHandlerMap.java:238)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
at com.google.apphosting.runtime.jetty.RpcRequestParser.parseAvailable(RpcRequestParser.java:76)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at com.google.apphosting.runtime.jetty.JettyServletEngineAdapter.serviceRequest(JettyServletEngineAdapter.java:135)
at com.google.apphosting.runtime.JavaRuntime.handleRequest(JavaRuntime.java:251)
at com.google.apphosting.base.RuntimePb$EvaluationRuntime$6.handleBlockingRequest(RuntimePb.java:6784)
at com.google.apphosting.base.RuntimePb$EvaluationRuntime$6.handleBlockingRequest(RuntimePb.java:6782)
at com.google.net.rpc.impl.BlockingApplicationHandler.handleRequest(BlockingApplicationHandler.java:24)
at com.google.net.rpc.impl.RpcUtil.runRpcInApplication(RpcUtil.java:398)
at com.google.net.rpc.impl.Server$2.run(Server.java:852)
at com.google.tracing.LocalTraceSpanRunnable.run(LocalTraceSpanRunnable.java:56)
at com.google.tracing.LocalTraceSpanBuilder.internalContinueSpan(LocalTraceSpanBuilder.java:576)
at com.google.net.rpc.impl.Server.startRpc(Server.java:807)
at com.google.net.rpc.impl.Server.processRequest(Server.java:369)
at com.google.net.rpc.impl.ServerConnection.messageReceived(ServerConnection.java:442)
at com.google.net.rpc.impl.RpcConnection.parseMessages(RpcConnection.java:319)
at com.google.net.rpc.impl.RpcConnection.dataReceived(RpcConnection.java:290)
at com.google.net.async.Connection.handleReadEvent(Connection.java:474)
at com.google.net.async.EventDispatcher.processNetworkEvents(EventDispatcher.java:831)
at com.google.net.async.EventDispatcher.internalLoop(EventDispatcher.java:207)
at com.google.net.async.EventDispatcher.loop(EventDispatcher.java:103)
at com.google.net.rpc.RpcService.runUntilServerShutdown(RpcService.java:251)
at com.google.apphosting.runtime.JavaRuntime$RpcRunnable.run(JavaRuntime.java:418)
at java.lang.Thread.run(Unknown Source)
Original comment by matt2224
on 19 Aug 2010 at 1:39
[deleted comment]
Hmm, I've somewhat got it working, although I get the following error on when
trying to 'crawl' certain pages:
com.gargoylesoftware.htmlunit.ScriptException: Exception invoking jsxGet_cookie
Original comment by matt2224
on 19 Aug 2010 at 6:09
Meanwhile, I filed an AppEngine issue for my problem. Please star it:
http://code.google.com/p/googleappengine/issues/detail?id=3602
Original comment by philippe.beaudoin
on 19 Aug 2010 at 6:11
Re: Uncaught exception from servlet
An older version HtmlUnit is included in the App Engine jar, IIRC, I got rid of
this error by reordering dependencies in my Build Path.
Original comment by philippe.beaudoin
on 19 Aug 2010 at 6:13
Yes, that gets rid of that error. :)
I'll star the issue, although http://ajax-crawler.appspot.com/ works absolutely
fine. Do you know where I could get the source code for that app? I've emailed
Amit about it, hopefully he'll reply. All the cases in which my app fails, his
one works, so maybe he's doing something slightly magical.
Original comment by matt2224
on 19 Aug 2010 at 6:17
Note: http://ajax-crawler.appspot.com/ works absolutely fine with the same URL.
Original comment by matt2224
on 19 Aug 2010 at 6:25
So does my app, as long as the page is served statically, which I'm ready to
bet is the case for the ajax-crawler app. I've never been able to get amit to
confirm that his app works well with dynamically generated content that is
needed with any app that fetches most of its content via an XMLHttpRequest.
(i.e. a typical GWT app)
Original comment by philippe.beaudoin
on 19 Aug 2010 at 6:35
Scratch that, it looks like ajax-crawler is using GWT and XMLHttpRequest. I
really don't know where my problem comes from then. I will give it another shot
soon and try to diagnose better.
Original comment by philippe.beaudoin
on 19 Aug 2010 at 6:36
Ah, well I tried a few dynamic websites with ajax-crawler, (try entering
http://www.google.com/codesearch/p?hl=en#RNY0MQIrFHY/tcl/compat/zlib/examples/zp
ipe.c&q=exa&sa=N&cd=8&ct=rc), it and worked as long as they didn't exceed the
30-second limit. Whether it would work from the same URL, I don't know.
Original comment by matt2224
on 19 Aug 2010 at 6:47
This worked from my app too. My problem is that I got it down to a very minimal
app with only a couple of lines, fetching itself, and this is what failed.
Maybe it's only if the initial request is dynamic. (I was serving a .jsp)
Original comment by philippe.beaudoin
on 19 Aug 2010 at 7:00
Hmm, can you host the source for your crawler test app somewhere please? I'm
having trouble even getting that level of functionality. I keep getting
"com.gargoylesoftware.htmlunit.ScriptException: Exception invoking
jsxGet_cookie"
Thanks!
Original comment by matt2224
on 19 Aug 2010 at 7:03
Im very excited to see you guys are trying to nut this one out! Ive subscribed
and will help wherever possible... If you make your tests open source that
would help ;-)
Original comment by i.chris....@gmail.com
on 20 Aug 2010 at 7:48
Phillipe? Any chance of getting the code you have?
Original comment by matt2224
on 22 Aug 2010 at 12:57
Hi Matt,
Sorry, I don't have much beside PuzzleBazar, which you can get at
http://code.google.com/p/puzzlebazar/. The rests were short-lived experiments I
never commited anywhere.
As soon as I have some time to work on this I'll make sure to let you know.
I'll make investigating this a priority for release 0.5. If I still run on the
previous problem I'll fallback on an AppEngine based service similar to what
you propose.
Original comment by philippe.beaudoin
on 22 Aug 2010 at 1:05
Ok. Thank you, I'll check out the PuzzleBazar code.
Original comment by matt2224
on 22 Aug 2010 at 1:11
Ideally I would love to see your HTMLunit work extracted out to a separate
project. The bare basics to get HTMLunit functional on GAE to produce html
snapshots.... a basic form to submit a URL (to test the API) and return results
as text/plain (?).
I don't have any contacts in the java community... But it would be great to get
others onboard if you guys know any one.
This project could be hugely popular... Particularly with google boasting ajax
crawlability with HTMLunit (but not offing any usefully instructions to get
people started!).
Also a separate app (service) would get around your current issues with same
domain queries failing.
FYI I'm working on an alternative to google's #! approach. Early days but
project started here: http://github.com/chrisjacob/headlessajax . My idea
should make JavaScript content crawlable by all search engines and accessible
to screen readers and people with js disabled.... Essentially the sever
delivers the page containing anchor elements <a></a> with hrefs with GET "?"
params, if js is enabled these urls are replaces with HASH params "#" on
DOMready - for non page refeshing dynamic content (Ajax). E.g an anchor tag
with href www.example.com?page=home&tab=tabA is converted to
www.example.com#page=home&tab=tabA. "?" URLs are indexable and followable by
non js users. "#" urls cause no refresh and make Ajax content deep linkable for
bookmarking, sharing and history (back/forward). Basically we offload js to the
server side for visitors that don't have js on the client side. Using HTMLunit
to generate static HTML snapshots for any GET requests. (there will be an
option to prevent ? To # conversion on an element, and also an option to
prevent htmlsnapshots being delivered for genuine GET requests).
I'm pretty excited about this idea as I feel it will be an excellent
alternative to google's approach.
Setting up HTMLunit for free on GAE as a service means either approach could be
easily adopted by the "average" web developer....
Wish I knew more java right now so I could be more helpful on that front ;-)
Original comment by i.chris....@gmail.com
on 22 Aug 2010 at 2:55
Problem is, you want to avoid using HtmlUnit to generate snapshots as much as
possible -- its much too expensive CPU-wise.
Original comment by matt2224
on 22 Aug 2010 at 3:10
Yes. Ideally a caching mechanism is planned for this feature. After all, you
don't expect your site being crawled by the spider every 10 minutes.
Original comment by philippe.beaudoin
on 22 Aug 2010 at 5:14
When I was speaking about it being expensive, I was referring to Chris Jacob's
idea of using HtmlUnit for people who don't have JS enabled.
Original comment by matt2224
on 22 Aug 2010 at 6:29
Thanks Matt for the warning re: CPU useage. Caching is certainly important to
help mitigate some of the load. Spiders and Non-JS Users should only make up a
VERY small percentage of visitors.... so it would be interesting to see how
things performed with this in mind.
Also the HTMLSnapshot generation is being offloaded to another server (for now
GAE).
I'm thinking of distinguishing between Crawlers and Non-JS users ... so I guess
you could selectively only enable the snapshots for Search Engines and not for
users. Or you could have a second GAE app for handling Non-JS user snapshots to
split the load. Some aggressive caching could be added if CPU becomes a big
issue.
Either way I still think it's exciting that the JS could b offloaded to the
server like this :-)
Original comment by i.chris....@gmail.com
on 22 Aug 2010 at 11:21
No progress on this now. Let's try to make it a priority for 0.5.
Original comment by philippe.beaudoin
on 2 Sep 2010 at 10:19
Original comment by philippe.beaudoin
on 22 Sep 2010 at 1:24
After sorting through the various forums, I decided to write a blog to put what
I did all in one place. The biggest problem now appears to be startup times for
your instance on App Engine resulting in timeouts. The other thing to watch is
to make sure that you don't use the redirects in an irresponsible way leading
to infinite loops.
Blog is here.
http://www.ozdroid.com/#!BLOG/2010/10/12/How_to_Make_Google_AppEngine_Applicatio
ns_Ajax_Crawlable
Geoff
Original comment by geoff.br...@gmail.com
on 14 Oct 2010 at 3:27
Thanks! A lot seems to be happening on that front...
Original comment by philippe.beaudoin
on 14 Oct 2010 at 3:48
Hi guys
I have tested http://www.ozdroid.com/ by simple Ajax google boot and server
encounters error. It seems your filter transforms URL in wrong way. The error
is
NOTE: This page will be indexed with the following URL: http://www.ozdroid.com/?#!BLOG
I guess a cause is ?. It should be http://www.ozdroid.com/#!BLOG
Original comment by abart...@gmail.com
on 15 Oct 2010 at 7:13
Firstly, Abartkiv please dont blame the good people on this project on this
project for my code. I am not on this project and only posted here as I thought
they might be interested.
They have certainly made good posts that interest me.
I'm not sure where you mean my mistake is but rewriteQueryString() is passed
the query string without the "?" that is it is passed
"_escaped_fragment_=BLOG"
which is transformed to
"#!BLOG"
Heres part of the log of the server
#
10-15 02:02AM 07.525
com.ozdroid.website.server.ajax_crawling.CrawlServlet rewriteQueryString:
Start: _escaped_fragment_=BLOG
#
I 10-15 02:02AM 07.525
com.ozdroid.website.server.ajax_crawling.CrawlServlet rewriteQueryString: End:
#!BLOG
I hope thats what you are talking about. I have actually had GoogleBot
succesfully crawl the dynamic content - but the timeouts - mostly caused by
URLFetch have been causing a head ache. Attached what GoogleBot Returned. The
post is in datastore, and if this lays out ok you can find it at the bottom.
Original comment by geoff.br...@gmail.com
on 15 Oct 2010 at 9:25
Attachments:
Hi
I didn't want to blame you or somebody else :). I just figured out that issue
and share because crawling on app engine for is important too. So far I have
not succeed and read that group quite often.
Original comment by abart...@gmail.com
on 15 Oct 2010 at 12:32
Bumping to 0.6, preparing release 0.5.
Original comment by philippe.beaudoin
on 25 Jan 2011 at 6:33
Finally manage to tackle this one!
GWTP now includes:
1) gwtp-crawler-service.jar, a simple service that can run on AppEngine and
which uses the following simple API:
http://mycrawlerservice.appspot.com/?key=mykey&url=http://urlencoded.url.to.rend
er
2) gwtp-crawler.jar, a simple filter that will intercept any URL containing an
_escaped_fragment_ parameter and render it using the service described above.
Retry & caching strategies are used to get around AppEngine limitations (30s
connection and 10s max URLFetch). Even then, it's highly recommended that you
make both your app and your crawler-service paid apps with always on.
An example of a crawler service has been added to GWTP's examples. It is
deployed on AppEngine, try:
http://crawlservice.appspot.com?key=123456&url=http://google.com
The gwtp-sample-hplace uses gwtp-crawler. It is also deployed on AppEngine, try:
http://hplacedemo.appspot.com/?_escaped_fragment_=homePage
The above examples may fail as they are not paid apps.
I'll leave the issue open for now as a reminded that documentation is needed in
the wiki.
Original comment by philippe.beaudoin
on 6 May 2011 at 3:27
Yeah! Done! The doc is here:
http://code.google.com/p/gwt-platform/wiki/CrawlerSupport
Original comment by philippe.beaudoin
on 7 May 2011 at 1:29
Thanks a lot for fixing this long pending issue :)
I have one quick question - The *gwtp-crawler-service* uses Google App
Engine's datastore to store the CachedPages. So I guess I need to make
changes to store the CachedPages in some other database if I want to deploy
*gwtp-crawler-service *on some other servlet container (tomcat etc). Can you
please confirm?
*
*
The reason I asked this question is that the wiki of gwt-platform says the
following about "gwtp-crawler-service":
*It is designed to run on AppEngine but can be called by your application
even if this one runs on another servlet container".*
Original comment by jchaga...@gmail.com
on 10 May 2011 at 4:44
Yeah, I probably to make that clearer on the wiki. Basically, the
service itself is tightly coupled with AppEngine, but the filter will
work with any service that conforms to the
?key=xyz&url=http://something API. So you can rewrite your own service
to run on another backend and use the filter with it. It shouldn't be
too hard to do given how simple gwtp-crawler-service is. (However, if
you do write it to run on another backend, I and many other would no
doubt appreciate if you considered open sourcing the result! :))
That being said, nothing prohibits you from using AppEngine for your
crawler service, but Tomcat for your application. You can even use
http://crawlservice.appspot.com/?key=123456
to test things out, so you only have to get the crawler filter to work
to start with.
Original comment by philippe.beaudoin
on 10 May 2011 at 6:58
Sure.. I would be more than glad to open source the result to be used for
other servlet containers. I will update you as soon as I am done.
Original comment by jchaga...@gmail.com
on 10 May 2011 at 9:58
Original issue reported on code.google.com by
philippe.beaudoin
on 27 Mar 2010 at 5:44