tokyotech / monarch-flex

Automatically exported from code.google.com/p/monarch-flex
0 stars 0 forks source link

Add Puneet's required websites #15

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Better do it early so we have good stats.

Original issue reported on code.google.com by tokyotech on 12 Apr 2009 at 6:55

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
finished adding:

http://groups.google.com/group/alt.politics.usa/topics?start=0&sa=N

Original comment by tokyotech on 12 Apr 2009 at 10:40

GoogleCodeExporter commented 9 years ago
Puneet asked for Twitter, so I'm in the process of adding
http://twitter.com/SarahPalin . But I'm not sure how your crawler will handle 
Twitter
- it has no threads, no replies, and "next page of threads" is AJAXed. Which 
regexes
should go where so your crawler doesn't crash?

Original comment by tokyotech on 13 Apr 2009 at 5:25

GoogleCodeExporter commented 9 years ago
The crawler won't crash. You just need to make sure that "threadURL" will match 
the
URL of the start page because the site is pretty much just a page of posts. You 
can
add null regexes for anything that doesn't exist. You need to test to make sure 
but I
think that's all you need to have in mind.

Original comment by andrewps...@gmail.com on 13 Apr 2009 at 9:23

GoogleCodeExporter commented 9 years ago
Puneet wanted Yahoo Groups, but to read a group, you need to be a member of that
group. So I guess we can't crawl Yahoo Groups, right?

Original comment by tokyotech on 20 Apr 2009 at 5:54

GoogleCodeExporter commented 9 years ago
Added Twitter.  All that's left is Yahoo Groups... (read last comment).

Original comment by tokyotech on 21 Apr 2009 at 11:26

GoogleCodeExporter commented 9 years ago
Decided not to do Yahoo Groups. All the lively groups are member-read-only. 
Added
Gizmodo instead.  It's weird how the replies don't show up in the HTML source, 
so I'm
only scraping for the first post right now.

Original comment by tokyotech on 24 Apr 2009 at 3:07

GoogleCodeExporter commented 9 years ago
This is the site I mentioned that Puneet had mentioned:

http://www.mail-archive.com/flexcoders@yahoogroups.com/

Original comment by andrewps...@gmail.com on 24 Apr 2009 at 5:01

GoogleCodeExporter commented 9 years ago
seems like this weird reply structure won't work since the replies have to be 
on the
same page as the first post, right?

Original comment by tokyotech on 24 Apr 2009 at 5:18

GoogleCodeExporter commented 9 years ago
yes, you're right. Our regular expressions don't support this.

Original comment by andrewps...@gmail.com on 24 Apr 2009 at 4:03

GoogleCodeExporter commented 9 years ago
Fuck Yahoo. They can't get anything right.

Original comment by tokyotech on 24 Apr 2009 at 7:52