gt-big-data QDoc issues

gt-big-data / QDoc

Quick & Dirty Operating Crawler

4 stars 1 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Log time for steps

#41 kurtcarpenter closed 8 years ago
1
albeu.com is broken

#40 kurtcarpenter closed 8 years ago
1
Fix computeFrequency.py.

#39 supersam654 closed 8 years ago
1
Deal with unicode warning.

#38 supersam654 closed 8 years ago
1
Save articles in bulk.

#37 supersam654 opened 8 years ago
0
Add a script to recrawl a single article.

#36 supersam654 opened 8 years ago
0
Add a script to recrawl an entire feed.

#35 supersam654 opened 8 years ago
0
Add a list of past crawl times to feed.

#34 supersam654 opened 8 years ago
0
Don't overwrite feed URLs.

#33 supersam654 closed 8 years ago
0
Add config file for DB, threads, etc

#32 kurtcarpenter closed 8 years ago
0
There are multiple feeds for thehour.com

#31 simon0929 opened 8 years ago
0
Articles on thehour.com parse very slowly.

#30 supersam654 closed 8 years ago
2
Print out how long certain steps take to finish.

#29 supersam654 closed 8 years ago
0
Only allow one instance of the crawler to run at a time.

#28 supersam654 opened 8 years ago
0
Don't make DB query when parsing an article.

#27 supersam654 opened 8 years ago
1
Trim article URLs before crawling.

#26 supersam654 closed 8 years ago
3
README includes first-time mongo commands

#25 kurtcarpenter closed 8 years ago
0
Parses feeds with dc:date

#24 simon0929 closed 8 years ago
1
Readme doesn't show initial config

#23 kurtcarpenter closed 8 years ago
2
Support published date in RSS feeds in dc:date tag.

#22 supersam654 closed 8 years ago
2
No timestamp. This includes a file called convertToDateTime.py that changes all entries in the database to ISO dates.

#21 Stuev closed 8 years ago
1
Gettext method testing

#20 tingofurro closed 8 years ago
1
Crawler adds captions to all images in carousel to article content.

#19 supersam654 closed 8 years ago
0
Get a better image from articles.

#18 supersam654 closed 8 years ago
0
Crawl from extra sources.

#17 supersam654 closed 8 years ago
0
Remove dependency on pytz.

#16 supersam654 closed 8 years ago
1
Crawler grabs bad content for this Reuters article.

#15 supersam654 closed 8 years ago
2
Better readme

#14 supersam654 closed 9 years ago
0
Make sure the crawler doesn't break if the stamps directory is deleted.

#13 supersam654 closed 9 years ago
1
Create a list of countries by scraping Wikipedia.

#12 supersam654 closed 9 years ago
0
Remove "N hours ago by PERSON NAME" for TechCrunch articles.

#11 supersam654 closed 9 years ago
3
Make the crawler support Ctrl + C

#10 supersam654 closed 8 years ago
0
Grab articles from an API

#9 supersam654 opened 9 years ago
0
Convert all timestamp numbers to Mongo DateTime objects.

#8 supersam654 closed 8 years ago
1
Add a real logging library.

#7 supersam654 opened 9 years ago
0
Add categories for articles.

#6 supersam654 closed 8 years ago
0
Crawl links from Twitter.

#5 supersam654 closed 8 years ago
0
Add a config file.

#4 supersam654 closed 8 years ago
0
Add some tests.

#3 supersam654 opened 9 years ago
1
Create an installation guide to help people get started.

#2 supersam654 closed 9 years ago
0
Upgrade from Python 2 to Python 3.

#1 supersam654 closed 9 years ago
1