Open GoogleCodeExporter opened 9 years ago
So what advantage will rss feed have over regular website crawl?
Will all feeds be saved as one file, and the updated consecutively as new feeds
arrive?
2. What kind of customization would it be?
Original comment by szybal...@gmail.com
on 3 Jul 2008 at 2:43
Original comment by abpil...@gmail.com
on 7 Jul 2008 at 8:16
I'm not the expert on rss but I think:
1. rss crawling should be able to crawl links in a feed, and be able treat them
the
same way we treat href=".." in html files for further processing.
2. Be able to read rss 1.0, 2.0, atom
3. Do we save the rss feeeds as .rss files or as .xml?
Original comment by szybal...@gmail.com
on 9 Jul 2008 at 2:50
Some thoughts:
1. rss can enable incremental crawls. I think that it requires you to keep
state,
though? If you're crawling a blog, for instance, you can find all new sites
since the
last crawl - however, rss typically won't tell you what has changed, only what
the
last "n" updates on the blog were.
2. Maybe (1) could imply auto generation of link following rules, to allow for
incremental crawls only of those links which are "related" to the changed
links? For
instance, only crawl new blog messages, and replies to those messages?
Also, what is meant by "first step to Web 2.0 integration"? Is there some grand
plan?
thanks!
Original comment by vijay...@gmail.com
on 9 Jul 2008 at 10:51
Original issue reported on code.google.com by
abpil...@gmail.com
on 25 Jun 2008 at 12:31