liseryang / openbotlist

Automatically exported from code.google.com/p/openbotlist
0 stars 0 forks source link

Current task list, for botlist - b 0.6 release #56

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Task list:
1. Clean up UI for main front-end, more like news.google.com or reddit.com
2. Clean up threading comment system (see reddit or news.ycombinator.com)
3. Work on crawling system - text mining system
4. Work on laughing man bot (see #botlist channel)
5. Work on archiving system
6. New hunchentoot front-end

Original issue reported on code.google.com by berlin.b...@gmail.com on 1 Mar 2008 at 12:06

GoogleCodeExporter commented 9 years ago

Original comment by berlin.b...@gmail.com on 2 Apr 2008 at 9:00

GoogleCodeExporter commented 9 years ago
As of 4/2/2008:

1. Clean up UI for main front-end, more like news.google.com or reddit.com
* Still on!

2. Clean up threading comment system (see reddit or news.ycombinator.com)
* Deprecated, save for 0.7 release

3. Work on crawling system - text mining system
* Still on!

4. Work on laughing man bot (see #botlist channel)
* Deprecated, save for 0.7 release

5. Work on archiving system
* Still on! (main priority)

6. New hunchentoot front-end
* Deprecated, save for 0.7 release
* possibly django
----------

3. Work on crawling system - text mining system

----------
5. Work on archiving system
----------

The archive system will be stored on Amazon's S3 with help from Amazon's EC2.

The goal: Store URL content data for research and analysis by third parties.
Groups: URLS are stored as groups.

Group Type Primary Key - Unique ID

Group Type(1): Date - URL content is stored in packages defined by date
"I collected 5000 URLS in a hour"

Group Type(2): Rating - details pending.  URL content is giving a rating.

Group Type(3): Hostname - contains useful information about the URL. 
http://www.ibm.com is about IBM?

SQLLite Remote Table structure:
X. Master Table
[ID], [uniqueid],[hostname], [Date=dd/mm/yyyy:hh], [Rating Combined], [group 
unique
id table]

Y. Sub Table has more content.
E.g. This sub table could have 
[group unique id table], [full URL]... and all of that information.

----------
Archive  issues (S3)
----------
W. There are only two types of tables.  The main master table with a list.  And 
each
sub table has the URL content.
U. We will have duplicate URLS.  Can remove duplicates if you combine all the 
sets.
V. Can't lookup a single URLS.  Can only find URLS by if you combine all the 
sets.
O. Write once archive system.

Database = sqlite

----------
Crawler Agent System (EC2)
----------

A. Launch an instance.  Crawl till completion.  Crawl rss feeds.  Crawl web.
B. RSS feed crawler will work as always.
C. Push database to S3 (gzipped file)

Database = MySQL

Original comment by berlin.b...@gmail.com on 5 Apr 2008 at 9:31

GoogleCodeExporter commented 9 years ago

Original comment by berlin.b...@gmail.com on 5 Apr 2008 at 9:32

GoogleCodeExporter commented 9 years ago
Modification to the "ghost-arc"/"ghostarc" project (short for "ghost-archive");

sqlite data payloads will be sent to s3 along with human readable semantic web 
documents.

sqlite data payloads will be sent ALSO to a ruby on rails system.

Original comment by berlin.b...@gmail.com on 10 Apr 2008 at 1:54

GoogleCodeExporter commented 9 years ago
More additions

Rabbit MQ or other queueing for bot messaging.

Original comment by berlin.b...@gmail.com on 10 Apr 2008 at 9:02

GoogleCodeExporter commented 9 years ago
Move the ad listings to the forums.   We are deprecating aspects of this 
application
and focusing on the web data.

Original comment by berlin.b...@gmail.com on 11 Apr 2008 at 12:26

GoogleCodeExporter commented 9 years ago
Build graphs for documentation with graphviz.

Original comment by berlin.b...@gmail.com on 11 Apr 2008 at 2:38