mediacloud rss-fetcher issues

mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.

https://search.mediacloud.org/directory

Apache License 2.0

5 stars 5 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Investigate crawling sitemaps

#41 philbudne opened 1 month ago
7
Create private config repo?

#40 philbudne opened 1 month ago
0
Implement Scrapy style concurrency controls?

#39 philbudne opened 4 months ago
0
Implement undead/zombie feeds?

#38 philbudne opened 4 months ago
1
integrate new user-agent string stored in mcmetadata

#37 rahulbot closed 4 months ago
0
how much would a different User-Agent string help fetch feeds we aren't geting right now?

#36 rahulbot closed 4 months ago
16
generated daily RSS file URL counts don't match stories table daily counts

#35 rahulbot closed 4 months ago
8
investigate why npr.org stories aren't making it into database

#34 rahulbot closed 4 months ago
12
would generating files 2-3 times a day shorten time-to-search-availability for users?

#33 rahulbot closed 2 months ago
5
rss-fetcher generated RSS feed contains URLs without schema

#32 philbudne opened 6 months ago
1
rss-fetcher RSS files contain URLs for non-text files

#31 philbudne opened 6 months ago
0
Add API endpoint for highest-story-count domains for a source

#30 philbudne opened 8 months ago
0
Add source information to RSS output

#29 philbudne closed 4 months ago
3
fetcher makes a lot of queries

#28 philbudne opened 8 months ago
0
possible race in scripts.update_feeds process

#27 philbudne opened 8 months ago
0
Index stories table by sources_id?

#26 philbudne opened 8 months ago
2
Prune old stories by count instead of age

#25 philbudne opened 8 months ago
1
Initial deployment of web process fails due to missing rss-output-files directory

#24 philbudne closed 8 months ago
1
initial commit for perf improvement in tasks save_stories_from_feed

#23 opme opened 1 year ago
3
save_stories_from_feed performance improvement

#22 opme opened 1 year ago
0
sources without feeds

#21 opme closed 1 year ago
1
Merge for v0.12.0

#20 philbudne closed 1 year ago
0
Conditional HTTP fetching

#19 philbudne closed 1 year ago
0
Import merged feeds

#18 rahulbot closed 1 year ago
0
Use feed <sy:updatePeriod> and <sy:updateFrequency> to set feeds.next_fetch_attempt

#17 philbudne opened 1 year ago
0
Use HTTP etag and last-modified headers to avoid fetching unchanged feed files

#16 philbudne closed 4 months ago
3
add feature to save RSS file, and headers, to disk

#15 rahulbot closed 1 year ago
5
rss file logging mode

#14 philbudne closed 1 year ago
1
file occasionally not showing up on s3

#13 rahulbot closed 1 year ago
1
incorrectly marking duplicate file hash as a fetch failure

#12 rahulbot closed 1 year ago
0
add media_name so we can do better deduplication

#11 rahulbot closed 1 year ago
1
why is story volume so low compared to prod system?

#10 rahulbot closed 4 months ago
16
Prevent re-queueing of a feed already queued

#9 rahulbot closed 5 months ago
6
be smarter about how frequently to update each feed

#8 rahulbot closed 5 months ago
6
plan out API for front-end

#7 rahulbot closed 7 months ago
3
deduplication: add/check normalized title hash

#6 rahulbot closed 2 years ago
1
deduplication: add/check URL normalization before storing

#5 rahulbot closed 2 years ago
2
setup automated rss file backups to S3

#4 rahulbot closed 2 years ago
2
remove non-resolving URLs

#3 rahulbot closed 1 year ago
2
compress RSS files after generating them

#2 rahulbot closed 2 years ago
0
setup automated database backups to AWS

#1 rahulbot closed 2 years ago
1