google-code-export / gwt-platform

Automatically exported from code.google.com/p/gwt-platform
1 stars 0 forks source link

Support search engine crawling #1

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Why not put this into the application 

http://code.google.com/web/ajaxcrawling/

will be a good demo

(Submitted by second.comet)

Link to original issue:
Original issue: http://code.google.com/p/puzzlebazar/issues/detail?id=27

Original issue reported on code.google.com by philippe.beaudoin on 27 Mar 2010 at 5:44

GoogleCodeExporter commented 9 years ago
Follow-up discussions:
http://stackoverflow.com/questions/2430244/making-gwt-application-crawlable-by-a
-
search-engine
http://groups.google.com/group/google-web-
toolkit/browse_thread/thread/15a922e701e9e2db?hl=en

Related branch:
http://code.google.com/p/google-web-toolkit/source/browse/branches/crawlability/

Related code-review:
http://groups.google.com/group/google-web-toolkit-
contributors/browse_thread/thread/88d4983324d328c5

Related issue in HTML Unit:
http://sourceforge.net/tracker/index.php?
func=detail&aid=2962074&group_id=47038&atid=448269#

Original comment by philippe.beaudoin on 27 Mar 2010 at 5:45

GoogleCodeExporter commented 9 years ago

Original comment by philippe.beaudoin on 27 Mar 2010 at 5:50

GoogleCodeExporter commented 9 years ago
Bumping to V0.2 pending progress on HTML Unit.

Original comment by philippe.beaudoin on 1 Apr 2010 at 6:23

GoogleCodeExporter commented 9 years ago
This is now available in the trunk. See PuzzleBazar revision 151 for an example 
and 
for all the required jar. Briefly, here is what you need to do:

- Copy all the required jar to your /war/WEB-INF/lib folder.
  - All the required jars are in htmlunit-package.zip from the download section.
  - Alternatively, get htmlunit-r5662-gae.jar from the download section and see:
      http://htmlunit.sourceforge.net/dependencies.html
    for the other dependencies.
- Rename your .html to a .jsp (otherwise filters are skipped)
- Make the change to .jsp in your web.xml
- Make sure all your crawlable name tokens start with !. i.e. 
MyApp.jsp#!MainPage
- In your configureServlets method, before the call to serve(), add:
    filter("*.jsp").through( CrawlFilter.class )

That's all!

To test:
- Compile with the GWT compiler (it doesn't work in development mode)
- Run as... Web application
- Navigate to:
    http://127.0.0.1:8888/YOUR_APP.jsp?_escaped_fragment_=DESIRED_NAME_TOKEN

You should get a static version of the page.

For more details on how to configure your application so that it is discovered 
by 
the spider, see:
http://code.google.com/web/ajaxcrawling/

Original comment by philippe.beaudoin on 13 May 2010 at 9:26

GoogleCodeExporter commented 9 years ago
Not quite done yet, it fails frequently on AppEngine.

Now you need to provide the WebClient, in the ServletModule:

  @Provides
  @SessionScoped
  WebClient provideWebClient() {
    return new WebClient( BrowserVersion.FIREFOX_3 );
  }

You can also specify the desired timeout for HTMLunit (in ms):
    bindConstant().annotatedWith( HtmlUnitTimeout.class ).to( 15000L );

Original comment by philippe.beaudoin on 13 May 2010 at 9:14

GoogleCodeExporter commented 9 years ago
I think I'm about to nail that one. From:
http://code.google.com/appengine/docs/java/urlfetch/overview.html#Requests

It looks like a servlet is prohibited from making a request to its own URI. 
There must 
be a way to work around that restriction.

Original comment by philippe.beaudoin on 15 May 2010 at 9:30

GoogleCodeExporter commented 9 years ago
More details...

The problem doesn't seem to be accessing the same URI, but rather the fact that 
the 
request is not served by a different thread. More details:

I request:
http://puzzlebazaar.appspot.com?_escaped_fragment_=main

In App Engine log I see the following:

The request for http://puzzlebazaar.appspot.com?_escaped_fragment_=main starts, 
it 
goes through the CrawlFilter, and then tries to request the URL 
http://puzzlebazaar.appspot.com#!main, and then I get an IOException :

com.philbeaudoin.gwtp.crawler.server.CrawlFilter logStackTrace: 
java.util.concurrent.ExecutionException: java.io.IOException: Timeout while 
fetching: 
http://puzzlebazaar.appspot.com#!main

This exception is caught, the servlet continues and terminated normally.

THEN, the request for http://puzzlebazaar.appspot.com#!main starts, but from 
the 
timing it's already too late, the requesting servlet has terminated.

- - - - -

As a side note, if I request a static page (say 
http://puzzlebazaar.appspot.com/staticpage.html) then everything works.

So it really seems to me that the filters/servlets in charge of handling 
http://puzzlebazaar.appspot.com#!main do not start, my analysis is that App 
Engine 
servlet container refuses to spawn a new thread for that new request.

From the logs, I see the following:
- www.

Original comment by philippe.beaudoin on 16 May 2010 at 10:58

GoogleCodeExporter commented 9 years ago
Bumping up. This is blocked for realease 0.2, at least on AppEngine. It should 
work on 
other servers, but I can't really test it.

Original comment by philippe.beaudoin on 17 May 2010 at 5:25

GoogleCodeExporter commented 9 years ago

Original comment by philippe.beaudoin on 18 May 2010 at 6:31

GoogleCodeExporter commented 9 years ago
Still no new on this one. Posted on AppEngine:
http://groups.google.com/group/google-appengine/browse_thread/thread/28a9f9737b1
b26b5

Original comment by philippe.beaudoin on 27 Jun 2010 at 5:49

GoogleCodeExporter commented 9 years ago
Htmlunit 2.8 supports app engine, apparently. 

Original comment by matt2224 on 18 Aug 2010 at 7:37

GoogleCodeExporter commented 9 years ago
Excellent! I compiled HtmlUnit with a patch to support AppEngine before, but 
encountered an internal problem in AppEngine. I'll take a look soon to see if 
every problem resolved itself magically!

Original comment by philippe.beaudoin on 18 Aug 2010 at 8:06

GoogleCodeExporter commented 9 years ago
Great! Please keep us updated on your findings. Crawling is pretty darn 
important in todays online world. 

Original comment by matt2224 on 19 Aug 2010 at 12:29

GoogleCodeExporter commented 9 years ago
Chris Jacob proposed to do an AppEngine-based service for anybody who needs 
HTMLunit to process their page. This strikes me as a brilliant idea that could 
already be done even with the current problem discussed in this thread. 
Somebody should do it!

http://stackoverflow.com/questions/3517944/making-ajax-applications-crawlable-ho
w-to-build-a-simple-web-service-on-google-a

Original comment by philippe.beaudoin on 19 Aug 2010 at 2:02

GoogleCodeExporter commented 9 years ago
http://code.google.com/p/gwt-platform/issues/detail?id=1#c4

I've just tried out what you said here, but with HtmlUnit 2.8 on GAE, and I get 
a server error.

Uncaught exception from servlet
java.lang.NoSuchMethodError: 
com.gargoylesoftware.htmlunit.WebClient.pumpEventLoop(J)I
    at com.gwtplatform.crawler.server.CrawlFilter.doFilter(CrawlFilter.java:124)
    at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.utils.servlet.ParseBlobUploadFilter.doFilter(ParseBlobUploadFilter.java:97)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.runtime.jetty.SaveSessionFilter.doFilter(SaveSessionFilter.java:35)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at com.google.apphosting.utils.servlet.TransactionCleanupFilter.doFilter(TransactionCleanupFilter.java:43)
    at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
    at com.google.apphosting.runtime.jetty.AppVersionHandlerMap.handle(AppVersionHandlerMap.java:238)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
    at com.google.apphosting.runtime.jetty.RpcRequestParser.parseAvailable(RpcRequestParser.java:76)
    at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
    at com.google.apphosting.runtime.jetty.JettyServletEngineAdapter.serviceRequest(JettyServletEngineAdapter.java:135)
    at com.google.apphosting.runtime.JavaRuntime.handleRequest(JavaRuntime.java:251)
    at com.google.apphosting.base.RuntimePb$EvaluationRuntime$6.handleBlockingRequest(RuntimePb.java:6784)
    at com.google.apphosting.base.RuntimePb$EvaluationRuntime$6.handleBlockingRequest(RuntimePb.java:6782)
    at com.google.net.rpc.impl.BlockingApplicationHandler.handleRequest(BlockingApplicationHandler.java:24)
    at com.google.net.rpc.impl.RpcUtil.runRpcInApplication(RpcUtil.java:398)
    at com.google.net.rpc.impl.Server$2.run(Server.java:852)
    at com.google.tracing.LocalTraceSpanRunnable.run(LocalTraceSpanRunnable.java:56)
    at com.google.tracing.LocalTraceSpanBuilder.internalContinueSpan(LocalTraceSpanBuilder.java:576)
    at com.google.net.rpc.impl.Server.startRpc(Server.java:807)
    at com.google.net.rpc.impl.Server.processRequest(Server.java:369)
    at com.google.net.rpc.impl.ServerConnection.messageReceived(ServerConnection.java:442)
    at com.google.net.rpc.impl.RpcConnection.parseMessages(RpcConnection.java:319)
    at com.google.net.rpc.impl.RpcConnection.dataReceived(RpcConnection.java:290)
    at com.google.net.async.Connection.handleReadEvent(Connection.java:474)
    at com.google.net.async.EventDispatcher.processNetworkEvents(EventDispatcher.java:831)
    at com.google.net.async.EventDispatcher.internalLoop(EventDispatcher.java:207)
    at com.google.net.async.EventDispatcher.loop(EventDispatcher.java:103)
    at com.google.net.rpc.RpcService.runUntilServerShutdown(RpcService.java:251)
    at com.google.apphosting.runtime.JavaRuntime$RpcRunnable.run(JavaRuntime.java:418)
    at java.lang.Thread.run(Unknown Source)

Original comment by matt2224 on 19 Aug 2010 at 1:39

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hmm, I've somewhat got it working, although I get the following error on when 
trying to 'crawl' certain pages:

com.gargoylesoftware.htmlunit.ScriptException: Exception invoking jsxGet_cookie

Original comment by matt2224 on 19 Aug 2010 at 6:09

GoogleCodeExporter commented 9 years ago
Meanwhile, I filed an AppEngine issue for my problem. Please star it:
http://code.google.com/p/googleappengine/issues/detail?id=3602

Original comment by philippe.beaudoin on 19 Aug 2010 at 6:11

GoogleCodeExporter commented 9 years ago
Re: Uncaught exception from servlet

An older version HtmlUnit is included in the App Engine jar, IIRC, I got rid of 
this error by reordering dependencies in my Build Path.

Original comment by philippe.beaudoin on 19 Aug 2010 at 6:13

GoogleCodeExporter commented 9 years ago
Yes, that gets rid of that error. :)

I'll star the issue, although http://ajax-crawler.appspot.com/ works absolutely 
fine. Do you know where I could get the source code for that app? I've emailed 
Amit about it, hopefully he'll reply. All the cases in which my app fails, his 
one works, so maybe he's doing something slightly magical.

Original comment by matt2224 on 19 Aug 2010 at 6:17

GoogleCodeExporter commented 9 years ago
Note: http://ajax-crawler.appspot.com/ works absolutely fine with the same URL.

Original comment by matt2224 on 19 Aug 2010 at 6:25

GoogleCodeExporter commented 9 years ago
So does my app, as long as the page is served statically, which I'm ready to 
bet is the case for the ajax-crawler app. I've never been able to get amit to 
confirm that his app works well with dynamically generated content that is 
needed with any app that fetches most of its content via an XMLHttpRequest. 
(i.e. a typical GWT app)

Original comment by philippe.beaudoin on 19 Aug 2010 at 6:35

GoogleCodeExporter commented 9 years ago
Scratch that, it looks like ajax-crawler is using GWT and XMLHttpRequest. I 
really don't know where my problem comes from then. I will give it another shot 
soon and try to diagnose better.

Original comment by philippe.beaudoin on 19 Aug 2010 at 6:36

GoogleCodeExporter commented 9 years ago
Ah, well I tried a few dynamic websites with ajax-crawler, (try entering 
http://www.google.com/codesearch/p?hl=en#RNY0MQIrFHY/tcl/compat/zlib/examples/zp
ipe.c&q=exa&sa=N&cd=8&ct=rc), it and worked as long as they didn't exceed the 
30-second limit. Whether it would work from the same URL, I don't know.

Original comment by matt2224 on 19 Aug 2010 at 6:47

GoogleCodeExporter commented 9 years ago
This worked from my app too. My problem is that I got it down to a very minimal 
app with only a couple of lines, fetching itself, and this is what failed. 
Maybe it's only if the initial request is dynamic. (I was serving a .jsp)

Original comment by philippe.beaudoin on 19 Aug 2010 at 7:00

GoogleCodeExporter commented 9 years ago
Hmm, can you host the source for your crawler test app somewhere please? I'm 
having trouble even getting that level of functionality. I keep getting 
"com.gargoylesoftware.htmlunit.ScriptException: Exception invoking 
jsxGet_cookie"

Thanks!

Original comment by matt2224 on 19 Aug 2010 at 7:03

GoogleCodeExporter commented 9 years ago
Im very excited to see you guys are trying to nut this one out! Ive subscribed 
and will help wherever possible... If you make your tests open source that 
would help ;-)

Original comment by i.chris....@gmail.com on 20 Aug 2010 at 7:48

GoogleCodeExporter commented 9 years ago
Phillipe? Any chance of getting the code you have?

Original comment by matt2224 on 22 Aug 2010 at 12:57

GoogleCodeExporter commented 9 years ago
Hi Matt,

Sorry, I don't have much beside PuzzleBazar, which you can get at 
http://code.google.com/p/puzzlebazar/. The rests were short-lived experiments I 
never commited anywhere.

As soon as I have some time to work on this I'll make sure to let you know. 
I'll make investigating this a priority for release 0.5. If I still run on the 
previous problem I'll fallback on an AppEngine based service similar to what 
you propose.

Original comment by philippe.beaudoin on 22 Aug 2010 at 1:05

GoogleCodeExporter commented 9 years ago
Ok. Thank you, I'll check out the PuzzleBazar code.

Original comment by matt2224 on 22 Aug 2010 at 1:11

GoogleCodeExporter commented 9 years ago
Ideally I would love to see your HTMLunit work extracted out to a separate 
project. The bare basics to get HTMLunit functional on GAE to produce html 
snapshots.... a basic form to submit a URL (to test the API) and return results 
as text/plain (?).

I don't have any contacts in the java community... But it would be great to get 
others onboard if you guys know any one.

This project could be hugely popular... Particularly with google boasting ajax 
crawlability with HTMLunit (but not offing any usefully instructions to get 
people started!).

Also a separate app (service) would get around your current issues with same 
domain queries failing.

FYI I'm working on an alternative to google's #! approach. Early days but 
project started here: http://github.com/chrisjacob/headlessajax . My idea 
should make JavaScript content crawlable by all search engines and accessible 
to screen readers and people with js disabled.... Essentially the sever  
delivers the page containing anchor elements <a></a> with hrefs with GET "?" 
params, if js is enabled these urls are replaces with HASH params "#" on 
DOMready - for non page refeshing dynamic content (Ajax). E.g an anchor tag 
with href www.example.com?page=home&tab=tabA is converted to 
www.example.com#page=home&tab=tabA. "?" URLs are indexable and followable by 
non js users. "#" urls cause no refresh and make Ajax content deep linkable for 
bookmarking, sharing and history (back/forward). Basically we offload js to the 
server side for visitors that don't have js on the client side. Using HTMLunit 
to generate static HTML snapshots for any GET requests. (there will be an 
option to prevent ? To # conversion on an element, and also an option to 
prevent htmlsnapshots being delivered for genuine GET requests).

I'm pretty excited about this idea as I feel it will be an excellent 
alternative to google's approach. 

Setting up HTMLunit for free on GAE as a service means either approach could be 
easily adopted by the "average" web developer....

Wish I knew more java right now so I could be more helpful on that front ;-)

Original comment by i.chris....@gmail.com on 22 Aug 2010 at 2:55

GoogleCodeExporter commented 9 years ago
Problem is, you want to avoid using HtmlUnit to generate snapshots as much as 
possible -- its much too expensive CPU-wise. 

Original comment by matt2224 on 22 Aug 2010 at 3:10

GoogleCodeExporter commented 9 years ago
Yes. Ideally a caching mechanism is planned for this feature. After all, you 
don't expect your site being crawled by the spider every 10 minutes.

Original comment by philippe.beaudoin on 22 Aug 2010 at 5:14

GoogleCodeExporter commented 9 years ago
When I was speaking about it being expensive, I was referring to Chris Jacob's 
idea of using HtmlUnit for people who don't have JS enabled. 

Original comment by matt2224 on 22 Aug 2010 at 6:29

GoogleCodeExporter commented 9 years ago
Thanks Matt for the warning re: CPU useage. Caching is certainly important to 
help mitigate some of the load. Spiders and Non-JS Users should only make up a 
VERY small percentage of visitors.... so it would be interesting to see how 
things performed with this in mind.

Also the HTMLSnapshot generation is being offloaded to another server (for now 
GAE).

I'm thinking of distinguishing between Crawlers and Non-JS users ... so I guess 
you could selectively only enable the snapshots for Search Engines and not for 
users. Or you could have a second GAE app for handling Non-JS user snapshots to 
split the load. Some aggressive caching could be added if CPU becomes a big 
issue.

Either way I still think it's exciting that the JS could b offloaded to the 
server like this :-)

Original comment by i.chris....@gmail.com on 22 Aug 2010 at 11:21

GoogleCodeExporter commented 9 years ago
No progress on this now. Let's try to make it a priority for 0.5.

Original comment by philippe.beaudoin on 2 Sep 2010 at 10:19

GoogleCodeExporter commented 9 years ago

Original comment by philippe.beaudoin on 22 Sep 2010 at 1:24

GoogleCodeExporter commented 9 years ago
After sorting through the various forums, I decided to write a blog to put what 
I did all in one place. The biggest problem now appears to be startup times for 
your instance on App Engine resulting in timeouts. The other thing to watch is 
to make sure that you don't use the redirects in an irresponsible way leading 
to infinite loops.

Blog is here.

http://www.ozdroid.com/#!BLOG/2010/10/12/How_to_Make_Google_AppEngine_Applicatio
ns_Ajax_Crawlable

Geoff

Original comment by geoff.br...@gmail.com on 14 Oct 2010 at 3:27

GoogleCodeExporter commented 9 years ago
Thanks! A lot seems to be happening on that front...

Original comment by philippe.beaudoin on 14 Oct 2010 at 3:48

GoogleCodeExporter commented 9 years ago
Hi guys 

I have tested http://www.ozdroid.com/ by simple Ajax google boot and server 
encounters error. It seems your filter transforms URL in wrong way. The error 
is 

 NOTE: This page will be indexed with the following URL: http://www.ozdroid.com/?#!BLOG 

I guess a cause is ?. It should be http://www.ozdroid.com/#!BLOG 

Original comment by abart...@gmail.com on 15 Oct 2010 at 7:13

GoogleCodeExporter commented 9 years ago
Firstly, Abartkiv please dont blame the good people on this project on this 
project for my code. I am not on this project and only posted here as I thought 
they might be interested.

They have certainly made good posts that interest me.

I'm not sure where you mean my mistake is but rewriteQueryString() is passed 
the query string without the "?" that is it is passed

"_escaped_fragment_=BLOG"

which is transformed to

"#!BLOG"

Heres part of the log of the server

#
10-15 02:02AM 07.525

com.ozdroid.website.server.ajax_crawling.CrawlServlet rewriteQueryString: 
Start: _escaped_fragment_=BLOG

#
I 10-15 02:02AM 07.525

com.ozdroid.website.server.ajax_crawling.CrawlServlet rewriteQueryString: End:  
#!BLOG

I hope thats what you are talking about. I have actually had GoogleBot 
succesfully crawl the dynamic content - but the timeouts - mostly caused by 
URLFetch have been causing a head ache. Attached what GoogleBot Returned. The 
post is in datastore, and if this lays out ok you can find it at the bottom.

Original comment by geoff.br...@gmail.com on 15 Oct 2010 at 9:25

Attachments:

GoogleCodeExporter commented 9 years ago
Hi

I didn't want to blame you or somebody else :). I just figured out that issue 
and share because crawling on app engine for is important too. So far I have 
not succeed and read that group quite often.   

Original comment by abart...@gmail.com on 15 Oct 2010 at 12:32

GoogleCodeExporter commented 9 years ago
Bumping to 0.6, preparing release 0.5.

Original comment by philippe.beaudoin on 25 Jan 2011 at 6:33

GoogleCodeExporter commented 9 years ago
Finally manage to tackle this one!

GWTP now includes:

1) gwtp-crawler-service.jar, a simple service that can run on AppEngine and 
which uses the following simple API:
http://mycrawlerservice.appspot.com/?key=mykey&url=http://urlencoded.url.to.rend
er

2) gwtp-crawler.jar, a simple filter that will intercept any URL containing an 
_escaped_fragment_ parameter and render it using the service described above.

Retry & caching strategies are used to get around AppEngine limitations (30s 
connection and 10s max URLFetch). Even then, it's highly recommended that you 
make both your app and your crawler-service paid apps with always on.

An example of a crawler service has been added to GWTP's examples. It is 
deployed on AppEngine, try:
  http://crawlservice.appspot.com?key=123456&url=http://google.com

The gwtp-sample-hplace uses gwtp-crawler. It is also deployed on AppEngine, try:
  http://hplacedemo.appspot.com/?_escaped_fragment_=homePage

The above examples may fail as they are not paid apps.

I'll leave the issue open for now as a reminded that documentation is needed in 
the wiki.

Original comment by philippe.beaudoin on 6 May 2011 at 3:27

GoogleCodeExporter commented 9 years ago
Yeah! Done! The doc is here:
http://code.google.com/p/gwt-platform/wiki/CrawlerSupport

Original comment by philippe.beaudoin on 7 May 2011 at 1:29

GoogleCodeExporter commented 9 years ago
Thanks a lot for fixing this long pending issue :)

I have one quick question - The *gwtp-crawler-service* uses Google App
Engine's datastore to store the CachedPages. So I guess I need to make
changes to store the CachedPages in some other database if I want to deploy
*gwtp-crawler-service *on some other servlet container (tomcat etc). Can you
please confirm?
*
*
The reason I asked this question is that the wiki of gwt-platform says the
following about  "gwtp-crawler-service":
*It is designed to run on AppEngine but can be called by your application
even if this one runs on another servlet container".*

Original comment by jchaga...@gmail.com on 10 May 2011 at 4:44

GoogleCodeExporter commented 9 years ago
Yeah, I probably to make that clearer on the wiki. Basically, the
service itself is tightly coupled with AppEngine, but the filter will
work with any service that conforms to the
?key=xyz&url=http://something API. So you can rewrite your own service
to run on another backend and use the filter with it. It shouldn't be
too hard to do given how simple gwtp-crawler-service is. (However, if
you do write it to run on another backend, I and many other would no
doubt appreciate if you considered open sourcing the result! :))

That being said, nothing prohibits you from using AppEngine for your
crawler service, but Tomcat for your application. You can even use
  http://crawlservice.appspot.com/?key=123456
to test things out, so you only have to get the crawler filter to work
to start with.

Original comment by philippe.beaudoin on 10 May 2011 at 6:58

GoogleCodeExporter commented 9 years ago
Sure.. I would be more than glad to open source the result to be used for
other servlet containers. I will update you as soon as I am done.

Original comment by jchaga...@gmail.com on 10 May 2011 at 9:58