uvacw / inca

24 stars 6 forks source link

volkskrant scraper not working #393

Closed FeLoe closed 6 years ago

FeLoe commented 6 years ago

Running the volkskrantscraper throws an error, needs to be fixed.

damian0604 commented 6 years ago

It seems we cannot reproduce the error, seems to work fine? @FeLoe , is this still an issue on your system? If so, can you post more info on the bug, and otherwise close the issue? Thanks!!

damian0604 commented 6 years ago

Also works on the server and in @mariekevh 's virtual box

FeLoe commented 6 years ago

hm.. whenever I run the scraper (myinca.rssscrapers.volkskrant(save = True)) I get this:

ValueError                                Traceback (most recent call last)
<ipython-input-5-08f10522bb1c> in <module>()
----> 1 myinca.rssscrapers.volkskrant(save = True)

~/inca_test/inca/inca/__main__.py in endpoint(*args, **kwargs)
    255                     else:
    256                         def endpoint(*args, **kwargs):
--> 257                             return method(*args, **kwargs)
    258                     return endpoint
    259 

~/inca_test/inca/inca/core/document_class.py in runwrap(self, action, *args, **kwargs)
     37         '''
     38         if action == 'run':
---> 39             return self.run(*args, **kwargs)
     40 
     41         if action == 'delay':

~/inca_test/inca/inca/core/scraper_class.py in run(self, save, *args, **kwargs)
     74         logger.info("Started scraping")
     75         if save == True:
---> 76             for doc in self.get(save, *args, **kwargs):
     77                 if type(doc)==dict:
     78                     doc = self._add_metadata(doc)

~/inca_test/inca/inca/scrapers/rss_scraper.py in get(self, save, **kwargs)
     86                 # do not want to look something up in the database. We therefore also retrieve it in
     87                 # that case.
---> 88                 if save==False or check_exists(_id)[0]==False:
     89                     try:
     90                         req=urllib2.Request(link, headers={'User-Agent' : "Wget/1.9"})

~/inca_test/inca/inca/core/database.py in check_exists(document_id)
     61     index = elastic_index
     62     try:
---> 63         retrieved = client.get(elastic_index,doc_type='_all', id=document_id)
     64         logger.debug('elastic_index {index} - document [{document_id}] found, return document'.format(**locals()))
     65         return True, retrieved

/usr/local/lib/python3.5/dist-packages/elasticsearch/client/utils.py in _wrapped(*args, **kwargs)
     74                 if p in kwargs:
     75                     params[p] = kwargs.pop(p)
---> 76             return func(*args, params=params, **kwargs)
     77         return _wrapped
     78     return _wrapper

/usr/local/lib/python3.5/dist-packages/elasticsearch/client/__init__.py in get(self, index, doc_type, id, params)
    407         for param in (index, doc_type, id):
    408             if param in SKIP_IN_PATH:
--> 409                 raise ValueError("Empty value passed for a required argument.")
    410         return self.transport.perform_request('GET', _make_path(index,
    411             doc_type, id), params=params)

ValueError: Empty value passed for a required argument.

Never happens with other scrapers - or am I doing anything wrong?

mariekevh commented 6 years ago

We tried it with save=False, Could that be why it works? (I don't have my laptop with me right now. Can't test it.)

FeLoe commented 6 years ago

Well, with save = False it always worked for me ;) The traceback also shows some elastic search issues with it (which I don't get because they should not be specific to the Volkskrant scraper?)