scieloorg / articlemetaapi

Biblioteca que implementa o uso da API do articlemeta.
BSD 2-Clause "Simplified" License
12 stars 10 forks source link

exception while iterating "documents_by_identifiers()" #10

Open bnewbold opened 4 years ago

bnewbold commented 4 years ago
[...]
  File "./dump_scielo.py", line 73, in run_article_ids
    for ident in cl.documents_by_identifiers(only_identifiers=True):
  File "/home/bnewbold/scratch/ingests/scielo/.venv/lib/python3.7/site-packages/articlemeta/client.py", line 496, in documents_by_identifiers
    identifiers = self._do_request(url, params=params).get('objects', [])
AttributeError: 'NoneType' object has no attribute 'get'

Python version: 3.7 articlemetaapi version: 1.26.6

This error happens after many timeouts. Maybe due to HTTP 429 back-off responses? The self._do_request(url, params=params) statement should perhaps be called first and then status checked.

jamilatta commented 4 years ago

@bnewbold is this constant error or is it sporadic error?

My intention is to know if this occurs in all processing? I need to know if you are not getting SciELO metadata, so that we can classify and prioritize this demand.

bnewbold commented 4 years ago

@jamilatta Thank you for your rapid reply!

This error occured on my first attempt, after iterating through about 19,700 identifiers. Here is the script I am writing:

https://gist.github.com/bnewbold/9918634282f6013e13174badbce64a93

I am running a second time now and have gotten past 50,000 identifiers, so this is probably sporadic. I'll note that I almost immediately get requests.exceptions.ReadTimeout errors (in both cases, trying from two separate machines). The complete failure happens if:

fail retrieving data from (http://articlemeta.scielo.org/api/v1/article/identifiers) attempt(1/10)

... all the attempts fail. I assume this is due to rate limiting, as mentioned in the source. Perhaps there should be an extra delay by default to prevent these timeouts?

As some context, I am hoping to extract the full metadata for all 900k - 1million articles as a JSON snapshot, to archive and include in https://fatcat.wiki. Particularly articles which do not have a DOI. If there is a more efficient way to achieve this, please let me know!

Thank you for maintaining articlemetaapi.

jamilatta commented 4 years ago

@bnewbold I will think a way to avoid all the attempts fail.

Lets me talk with coworkers to think about and soon I return to you.

Thanks.