Open GoogleCodeExporter opened 9 years ago
Original comment by berlin.b...@gmail.com
on 2 Apr 2008 at 9:00
As of 4/2/2008:
1. Clean up UI for main front-end, more like news.google.com or reddit.com
* Still on!
2. Clean up threading comment system (see reddit or news.ycombinator.com)
* Deprecated, save for 0.7 release
3. Work on crawling system - text mining system
* Still on!
4. Work on laughing man bot (see #botlist channel)
* Deprecated, save for 0.7 release
5. Work on archiving system
* Still on! (main priority)
6. New hunchentoot front-end
* Deprecated, save for 0.7 release
* possibly django
----------
3. Work on crawling system - text mining system
----------
5. Work on archiving system
----------
The archive system will be stored on Amazon's S3 with help from Amazon's EC2.
The goal: Store URL content data for research and analysis by third parties.
Groups: URLS are stored as groups.
Group Type Primary Key - Unique ID
Group Type(1): Date - URL content is stored in packages defined by date
"I collected 5000 URLS in a hour"
Group Type(2): Rating - details pending. URL content is giving a rating.
Group Type(3): Hostname - contains useful information about the URL.
http://www.ibm.com is about IBM?
SQLLite Remote Table structure:
X. Master Table
[ID], [uniqueid],[hostname], [Date=dd/mm/yyyy:hh], [Rating Combined], [group
unique
id table]
Y. Sub Table has more content.
E.g. This sub table could have
[group unique id table], [full URL]... and all of that information.
----------
Archive issues (S3)
----------
W. There are only two types of tables. The main master table with a list. And
each
sub table has the URL content.
U. We will have duplicate URLS. Can remove duplicates if you combine all the
sets.
V. Can't lookup a single URLS. Can only find URLS by if you combine all the
sets.
O. Write once archive system.
Database = sqlite
----------
Crawler Agent System (EC2)
----------
A. Launch an instance. Crawl till completion. Crawl rss feeds. Crawl web.
B. RSS feed crawler will work as always.
C. Push database to S3 (gzipped file)
Database = MySQL
Original comment by berlin.b...@gmail.com
on 5 Apr 2008 at 9:31
Original comment by berlin.b...@gmail.com
on 5 Apr 2008 at 9:32
Modification to the "ghost-arc"/"ghostarc" project (short for "ghost-archive");
sqlite data payloads will be sent to s3 along with human readable semantic web
documents.
sqlite data payloads will be sent ALSO to a ruby on rails system.
Original comment by berlin.b...@gmail.com
on 10 Apr 2008 at 1:54
More additions
Rabbit MQ or other queueing for bot messaging.
Original comment by berlin.b...@gmail.com
on 10 Apr 2008 at 9:02
Move the ad listings to the forums. We are deprecating aspects of this
application
and focusing on the web data.
Original comment by berlin.b...@gmail.com
on 11 Apr 2008 at 12:26
Build graphs for documentation with graphviz.
Original comment by berlin.b...@gmail.com
on 11 Apr 2008 at 2:38
Original issue reported on code.google.com by
berlin.b...@gmail.com
on 1 Mar 2008 at 12:06