issues
search
gt-big-data
/
QDoc
Quick & Dirty Operating Crawler
4
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Log time for steps
#41
kurtcarpenter
closed
8 years ago
1
albeu.com is broken
#40
kurtcarpenter
closed
8 years ago
1
Fix computeFrequency.py.
#39
supersam654
closed
8 years ago
1
Deal with unicode warning.
#38
supersam654
closed
8 years ago
1
Save articles in bulk.
#37
supersam654
opened
8 years ago
0
Add a script to recrawl a single article.
#36
supersam654
opened
8 years ago
0
Add a script to recrawl an entire feed.
#35
supersam654
opened
8 years ago
0
Add a list of past crawl times to feed.
#34
supersam654
opened
8 years ago
0
Don't overwrite feed URLs.
#33
supersam654
closed
8 years ago
0
Add config file for DB, threads, etc
#32
kurtcarpenter
closed
8 years ago
0
There are multiple feeds for thehour.com
#31
simon0929
opened
8 years ago
0
Articles on thehour.com parse very slowly.
#30
supersam654
closed
8 years ago
2
Print out how long certain steps take to finish.
#29
supersam654
closed
8 years ago
0
Only allow one instance of the crawler to run at a time.
#28
supersam654
opened
8 years ago
0
Don't make DB query when parsing an article.
#27
supersam654
opened
8 years ago
1
Trim article URLs before crawling.
#26
supersam654
closed
8 years ago
3
README includes first-time mongo commands
#25
kurtcarpenter
closed
8 years ago
0
Parses feeds with dc:date
#24
simon0929
closed
8 years ago
1
Readme doesn't show initial config
#23
kurtcarpenter
closed
8 years ago
2
Support published date in RSS feeds in dc:date tag.
#22
supersam654
closed
8 years ago
2
No timestamp. This includes a file called convertToDateTime.py that changes all entries in the database to ISO dates.
#21
Stuev
closed
8 years ago
1
Gettext method testing
#20
tingofurro
closed
8 years ago
1
Crawler adds captions to all images in carousel to article content.
#19
supersam654
closed
8 years ago
0
Get a better image from articles.
#18
supersam654
closed
8 years ago
0
Crawl from extra sources.
#17
supersam654
closed
8 years ago
0
Remove dependency on pytz.
#16
supersam654
closed
8 years ago
1
Crawler grabs bad content for this Reuters article.
#15
supersam654
closed
8 years ago
2
Better readme
#14
supersam654
closed
9 years ago
0
Make sure the crawler doesn't break if the stamps directory is deleted.
#13
supersam654
closed
9 years ago
1
Create a list of countries by scraping Wikipedia.
#12
supersam654
closed
9 years ago
0
Remove "N hours ago by PERSON NAME" for TechCrunch articles.
#11
supersam654
closed
9 years ago
3
Make the crawler support Ctrl + C
#10
supersam654
closed
8 years ago
0
Grab articles from an API
#9
supersam654
opened
9 years ago
0
Convert all timestamp numbers to Mongo DateTime objects.
#8
supersam654
closed
8 years ago
1
Add a real logging library.
#7
supersam654
opened
9 years ago
0
Add categories for articles.
#6
supersam654
closed
8 years ago
0
Crawl links from Twitter.
#5
supersam654
closed
8 years ago
0
Add a config file.
#4
supersam654
closed
8 years ago
0
Add some tests.
#3
supersam654
opened
9 years ago
1
Create an installation guide to help people get started.
#2
supersam654
closed
9 years ago
0
Upgrade from Python 2 to Python 3.
#1
supersam654
closed
9 years ago
1