issues
search
mediacloud
/
rss-fetcher
Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5
stars
5
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Investigate crawling sitemaps
#41
philbudne
opened
1 month ago
7
Create private config repo?
#40
philbudne
opened
1 month ago
0
Implement Scrapy style concurrency controls?
#39
philbudne
opened
4 months ago
0
Implement undead/zombie feeds?
#38
philbudne
opened
4 months ago
1
integrate new user-agent string stored in mcmetadata
#37
rahulbot
closed
4 months ago
0
how much would a different User-Agent string help fetch feeds we aren't geting right now?
#36
rahulbot
closed
4 months ago
16
generated daily RSS file URL counts don't match stories table daily counts
#35
rahulbot
closed
4 months ago
8
investigate why npr.org stories aren't making it into database
#34
rahulbot
closed
4 months ago
12
would generating files 2-3 times a day shorten time-to-search-availability for users?
#33
rahulbot
closed
2 months ago
5
rss-fetcher generated RSS feed contains URLs without schema
#32
philbudne
opened
6 months ago
1
rss-fetcher RSS files contain URLs for non-text files
#31
philbudne
opened
6 months ago
0
Add API endpoint for highest-story-count domains for a source
#30
philbudne
opened
8 months ago
0
Add source information to RSS output
#29
philbudne
closed
4 months ago
3
fetcher makes a lot of queries
#28
philbudne
opened
8 months ago
0
possible race in scripts.update_feeds process
#27
philbudne
opened
8 months ago
0
Index stories table by sources_id?
#26
philbudne
opened
8 months ago
2
Prune old stories by count instead of age
#25
philbudne
opened
8 months ago
1
Initial deployment of web process fails due to missing rss-output-files directory
#24
philbudne
closed
8 months ago
1
initial commit for perf improvement in tasks save_stories_from_feed
#23
opme
opened
1 year ago
3
save_stories_from_feed performance improvement
#22
opme
opened
1 year ago
0
sources without feeds
#21
opme
closed
1 year ago
1
Merge for v0.12.0
#20
philbudne
closed
1 year ago
0
Conditional HTTP fetching
#19
philbudne
closed
1 year ago
0
Import merged feeds
#18
rahulbot
closed
1 year ago
0
Use feed <sy:updatePeriod> and <sy:updateFrequency> to set feeds.next_fetch_attempt
#17
philbudne
opened
1 year ago
0
Use HTTP etag and last-modified headers to avoid fetching unchanged feed files
#16
philbudne
closed
4 months ago
3
add feature to save RSS file, and headers, to disk
#15
rahulbot
closed
1 year ago
5
rss file logging mode
#14
philbudne
closed
1 year ago
1
file occasionally not showing up on s3
#13
rahulbot
closed
1 year ago
1
incorrectly marking duplicate file hash as a fetch failure
#12
rahulbot
closed
1 year ago
0
add media_name so we can do better deduplication
#11
rahulbot
closed
1 year ago
1
why is story volume so low compared to prod system?
#10
rahulbot
closed
4 months ago
16
Prevent re-queueing of a feed already queued
#9
rahulbot
closed
5 months ago
6
be smarter about how frequently to update each feed
#8
rahulbot
closed
5 months ago
6
plan out API for front-end
#7
rahulbot
closed
7 months ago
3
deduplication: add/check normalized title hash
#6
rahulbot
closed
2 years ago
1
deduplication: add/check URL normalization before storing
#5
rahulbot
closed
2 years ago
2
setup automated rss file backups to S3
#4
rahulbot
closed
2 years ago
2
remove non-resolving URLs
#3
rahulbot
closed
1 year ago
2
compress RSS files after generating them
#2
rahulbot
closed
2 years ago
0
setup automated database backups to AWS
#1
rahulbot
closed
2 years ago
1