spartakos87 / greek_sites_crawler

Programm which can crawl plenty of greek sites
GNU General Public License v3.0
14 stars 4 forks source link

Any suggetions? #6

Closed spartakos87 closed 7 years ago

spartakos87 commented 7 years ago

Any suggetios for site to add?

fkolokathi commented 7 years ago

altsantiri.gr http://www.press-gr.com/ madata.gr karfitsa.gr paraskhnio.gr gossip-tv.gr fimotro.gr neolaia.gr thestival.gr athensvoice.gr voicenews.gr tanea.gr fimes.gr http://www.inewsgr.com/troktiko.htm http://newpost.gr/ http://flashnews.gr/ start.gr http://www.freepen.gr/ http://www.sdna.gr/ http://www.crashonline.gr/ http://www.trelokouneli.gr/ real.gr http://www.espressonews.gr/

And some blogposts:

http://tro-ma-ktiko.blogspot.gr/ http://kafeneio-gr.blogspot.com/ http://www.press-gr.com/ http://myblogs.gr/

For events:

https://www.rockap.gr/ http://www.rocking.gr/agenda http://www.culturenow.gr/ http://www.clickatlife.gr/ http://www.athinorama.gr/ https://www.musicity.gr/ https://sinavlia.gr/ http://www.avopolis.gr/

You can see here for some suggestions here:

http://koutakia.gr/nea.html https://www.topgr.gr/index.php?category=3


*** For me skai.gr still does not work.Moreover instead of checking for exception I checked if the topic or the text is None.With the following code I do not take errors of pages that do not have text.For example:

def alfavita(html): if( html.find("div",{"class":"field-item even"}) is None): s="empty page" return s elif(html.find("div",{"class":"field field-name-body field-type-text-with-summary field-label-hidden"}) is None): s="empty page" return s else: topic = html.find("div",{"class":"field-item even"}).text title = html.title.text article = html.find("div",{"class":"field field-name-body field-type-text-with-summary field-label-hidden"}).text publish_time = html.find("span",{"class":"uk-text-muted uk-text-small"}).text.split('|')[0] return {'topic':topic, 'title':title, 'article':article, 'publish_time':publish_time }

Finally, I believe that you should check the codification from html code because the text (and date and title,topic) has some issues and for some sites it is not readable because it has a lot of strings like \xa0 (apart from \t,\n which are easy to be taken out.)etc.

spartakos87 commented 7 years ago

ok, Lot of work :+1: Do you want to join me to develop it together???

fkolokathi commented 7 years ago

I'd like but I have another project for my diplomatic.I need the news for one part of my diplomatic in order to use them for machine learning tasks so I do not have a lot of time to collaborate with you.But if I find sth useful such us sth about the encoding I will tell you.I am not specialized in HTML,CSS as you.I use newspaper3k library to download text but I cannot take the published date from them which I need for my task. I will mention for your lib in my diplomatic!

spartakos87 commented 7 years ago

Thx for mention , I try to solve the issues and add crawlers from the sites you just mentio. If you need anything just sent me an email

fkolokathi commented 7 years ago

Would you extract publish date with the same format for all websites?

On Fri, Jul 14, 2017 at 6:05 PM, spartakos87 <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Thx for mention , I try to solve the issues and add crawlers from the sites you just mentio. If you need anything just sent me an email

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spartakos87/greek_sites_crawler/issues/6#issuecomment-315383101, or mute the thread https://github.com/notifications/unsubscribe-auth/AOp-4xE-RMjQoVGYJKW-5aC22Zjn-0jDks5sN4OCgaJpZM4OX9Re .

fkolokathi commented 7 years ago

Υπάρχει πρόβλημα στο skai κατι με τα index βγάζει "list index out of range".Προσπάθησα να το διορθώσω αλλά δεν τα κατάφερα. Επίσης, το fimotro τρέχει μόνο μέσω της συνάρτησης fimotro και όχι μέσω της get_crawler.

On Sat, Jul 15, 2017 at 9:32 AM, fotini kolokathi < fotinikolokathi91@gmail.com> wrote:

Would you extract publish date with the same format for all websites?

On Fri, Jul 14, 2017 at 6:05 PM, spartakos87 notifications@github.com wrote:

Thx for mention , I try to solve the issues and add crawlers from the sites you just mentio. If you need anything just sent me an email

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/spartakos87/greek_sites_crawler/issues/6#issuecomment-315383101, or mute the thread https://github.com/notifications/unsubscribe-auth/AOp-4xE-RMjQoVGYJKW-5aC22Zjn-0jDks5sN4OCgaJpZM4OX9Re .

fkolokathi commented 7 years ago

Moreover I believe that you need to put timeout between requests if you want to make a lot of requests.

fkolokathi commented 7 years ago

Cnn does not work neither skai.I make some corrections to skai and it runs but it does not give the published date:

def skai(html): if(html.find("h3",{"class":"section-title"}) is None): s="empty page" return s elif(html.find('article') is None): s="empty page" return s elif(html.find("meta",{"name":'publish-date'}) is None): s="empty page" return s else: topic = str(html.find("h3",{"class":"section-title"})).split('>')[1].split('<')[0] title = str(html.find('title')).replace('','').replace('','') article = html.find('article').text publish_time = str(re.findall(r'\d{2}/\d{2}/\d{4}',html.find("meta",{"name":'publish-date'}).text)) return {'topic':topic, 'title':title, 'article':article, 'publish_time':publish_time }

spartakos87 commented 7 years ago

Fix it The bug was in neolaia.py

spartakos87 commented 7 years ago

And tanea.gr is already exist