uvacw / inca

24 stars 6 forks source link

Nick's scrapers #521

Closed annekroon closed 4 years ago

annekroon commented 4 years ago

The following scrapers are adjusted, working, and ready for production:

myinca.rssscrapers.diewelt(save=False), myinca.rssscrapers.aachenerzeitung(save=False), myinca.rssscrapers.rheinischepost(save=False), myinca.rssscrapers.stuttgarterzeitung(save=False), myinca.rssscrapers.dertagesspiegel(save=False), myinca.rssscrapers.diesueddeutsche(save=False)

these scrapers can be moved to 'master'

We did not include 'paywall_na = TRUE' as sometimes there are also videos or even short news updates without body text as well. This also differs per newspaper. Die Welt, for example, has quite a lot of video content compared to some of the other newspapers. For the Stuttgarter Zeitung, their RSS feed also features a city blog that the text-scraping doesn't work for either (it only features local news for the city, so it's not relevant for my purposes anyway. But still...). So the way I see it using "paywall_na" =TRUE would be misleading in too many cases to be of value here.

FeLoe commented 4 years ago

I'm not even a reviewer here, so completely ignore my comment if you want :) Also I do not think this needs to be added now, maybe later. The paywall_na variable is not related to not having text but to having something in the HTML that only appears if there is a paywall (for welt.de it for example the paywall has "data-testid = "Offer-Card"). This way we can still see whether no text is related to video content or to a paywall.

damian0604 commented 4 years ago

@FeLoe "I'm not even a reviewer here" --> now you are ;-) [in case you have time]

annekroon commented 4 years ago

@FeLoe @damian0604 hebben jullie tijd deze te accepten? Ik heb ze gechecked en ze zien er goed uit. Joejoe

FeLoe commented 4 years ago

Ik heb het nu geaccepteerd - maar ik ben geen "authorized user", dus je hebt de accept van Damian nodig om te mergen ;)

damian0604 commented 4 years ago

Done! Dankjewel, @FeLoe !

annekroon commented 4 years ago

Bedankt beiden!

annekroon commented 4 years ago

@damian0604 deze veranderingen moeten nu ook 'in production'. hoe merge ik nu met master?

damian0604 commented 4 years ago

Maak een PR master<—development en laat mij die reviewen

Verstuurd vanaf mijn iPhone

Op 16 apr. 2020 om 19:57 heeft annekroon notifications@github.com<mailto:notifications@github.com> het volgende geschreven:

@damian0604https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdamian0604&data=02%7C01%7Cd.c.trilling%40uva.nl%7Cfb89fe3fc53c483815fb08d7e22fa034%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637226566489439670&sdata=eUExqrJB8NdJ1u0y4Nvkt8GCsSTcE9YAWVfW2IyrT24%3D&reserved=0 deze veranderingen moeten nu ook 'in production'. hoe merge ik nu met master?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fuvacw%2Finca%2Fpull%2F521%23issuecomment-614804546&data=02%7C01%7Cd.c.trilling%40uva.nl%7Cfb89fe3fc53c483815fb08d7e22fa034%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637226566489439670&sdata=PGnwLKtQBe3tNukZZP6JlX%2FN6GQgs92Rn71t2Y%2B%2BOgI%3D&reserved=0, or unsubscribehttps://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABNTYHYA3ZSEJQ4WB2OIIXLRM5BILANCNFSM4MHSZIOQ&data=02%7C01%7Cd.c.trilling%40uva.nl%7Cfb89fe3fc53c483815fb08d7e22fa034%7Ca0f1cacd618c4403b94576fb3d6874e5%7C1%7C0%7C637226566489449661&sdata=5KXTRmVFVBLjQnoO%2B%2B7lYwkQoPe%2BgfgC28Ey8QPliac%3D&reserved=0.