janastu / IIHS-TNUSSP-feed

This repository contains all the info regarding use cases and tests performed on newsrack.in for iihs.
0 stars 0 forks source link

iihs_sources_v3: All sources of economic times may not be configured, hence it's showing up missed article #16

Open salus-sage opened 7 years ago

salus-sage commented 7 years ago

Valid rss urls can be found by view source in each category Test case: i'm configuring politics and nation to test it out http://economictimes.indiatimes.com/news/politics-and-nation/rssfeeds/1052732854.cms The key pattern missing in config is /category/sub-cat/rssfeeds/id.cms

khushpreet-kaur commented 7 years ago

Structure added to the sheet.

salus-sage commented 7 years ago

These are sources with RSS, why shud they go on the sheet?

khushpreet-kaur commented 7 years ago

@salus-sage True. Economics Times provide ~156 link on this page: http://economictimes.indiatimes.com/rss.cms

Dealing with huge url list, I would rather prefer script than manually crawling(to verify, manually it took ~2.15hr in just crawling/copy-pasting/formatting link in sources_v3 for ET's 156 url, Which I feel is not required anymore!)

On top of that, my understanding is that if we have a script doing all the crawling/generating, why not update the script instead sources directly. This is anyway more reliable method, individual pasted url may change in future and sources file may get stale.

What do you think?

salus-sage commented 7 years ago

But the crawlers are to generate RSS. Not to just scrape links. For that u remember V used a online service? I don't want more links in crawler list until we crack the regex.

On 10 Feb 2017 10:18 a.m., "Khushpreet" notifications@github.com wrote:

@salus-sage https://github.com/salus-sage True. Economics Times provide ~156 link on this page: http://economictimes.indiatimes.com/rss.

Dealing with huge url list, I would rather prefer script than manually crawling(to verify, manually it took ~2.15hr in just crawling/copy-pasting/formatting link in sources_v3 for ET's 156 url, Which I feel is not required anymore!)

My understanding is that if we have a script doing all the crawling/generating, why not update the script instead sources directly. This is anyway more reliable method, individual pasted url may change in future. What do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/janastu/IIHS-TNUSSP-feed/issues/16#issuecomment-278857007, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGlW09bb7G7_55zHCJ9wZif7KH2FEltks5ra-wwgaJpZM4L8LFL .

salus-sage commented 7 years ago

totally about 149 categories available, but configuration is partial on newsrack. can expect this source to show up in the missed articles category during testing