nimh-dsst / sharestats-leo-notfork

importing leo's code to get around LFS restriction on forks
0 stars 1 forks source link

Current get_pmids_articles.py generating many 400 bad requests #5

Closed joshlawrimore closed 1 month ago

joshlawrimore commented 1 month ago

The recently scraped PMIDs are causing several hundred errors when calling

pma = PubMedFetcher().article_by_pmid(pmid)

see https://github.com/nimh-dsst/sharestats-leo-notfork/blob/dev_2024/scripts/get_pmids_articles.py#L31

Vast majority of errors look like this in the get_pmids_articles_error.txt:

37249733: Error parsing response object from NCBI: Bad Request (400): Unknown Error

Some look like this:

36795454: Invalid ID "36795454" (rejected by Eutils); please check the number and try again.

or this

36434483: Pubmed ID "36434483" not found
: Error parsing response object from NCBI: Bad Request (400): Unknown Error ```
joshlawrimore commented 1 month ago

Tracing the issue through source code of metapub see https://github.com/metapub/metapub/blob/master/metapub/pubmedfetcher.py#L91

result = self.qs.efetch({'db': 'pubmed', 'id': pmid})

qs is defined by

self.qs = get_eutils_client(self._cache_path)

which is a method defined here https://github.com/metapub/metapub/blob/master/metapub/eutils_common.py#L15

which is simply constructing an eutils QueryService as defined here: https://github.com/biocommons/eutils/blob/main/src/biocommons/eutils/_internal/queryservice.py#L47

I am going to attempt a direct call to the PMIDs that are confirmed to actually exist on PubMed to eutils QueryService to try and recreate the erros

agt24 commented 1 month ago

I have a vague recollection that pubmed can be very sensitive on many rapid requests from the same IP. You might consider trying to add or increase the pause between queries.

On Thu, Oct 10, 2024 at 9:30 AM joshlawrimore @.***> wrote:

Tracing the issue through source code of metapub see https://github.com/metapub/metapub/blob/master/metapub/pubmedfetcher.py#L91

result = self.qs.efetch({'db': 'pubmed', 'id': pmid})

qs is defined by

self.qs = get_eutils_client(self._cache_path)

which is a method defined here https://github.com/metapub/metapub/blob/master/metapub/eutils_common.py#L15

which is simply constructing an eutils QueryService as defined here: https://github.com/biocommons/eutils/blob/main/src/biocommons/eutils/_internal/queryservice.py#L47

I am going to attempt a direct call to the PMIDs that are confirmed to actually exist on PubMed to eutils QueryService to try and recreate the erros

— Reply to this email directly, view it on GitHub https://github.com/nimh-dsst/sharestats-leo-notfork/issues/5#issuecomment-2405101961, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4BEWMUEBXLTXPOU2YQ5MLZ2Z6Q5AVCNFSM6AAAAABPWXERMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVGEYDCOJWGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

joshlawrimore commented 1 month ago

According to https://www.ncbi.nlm.nih.gov/books/NBK25497/:

API Keys
Since December 1, 2018, NCBI has provided API keys that offer enhanced levels of supported access to the E-utilities. Without an API key, any site (IP address) posting more than 3 requests per second to the E-utilities will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request ([vog.hin.mln.ibcn@seitilitue](mailto:dev@null)). Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit [http://www.ncbi.nlm.nih.gov/account/](https://www.ncbi.nlm.nih.gov/account/)). After creating the key, users should include it in each E-utility request by assigning it to the api_key parameter.

Example request including an API key:
esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345

Example error message if rates are exceeded:
{"error":"API rate limit exceeded","count":"11"}
Only one API key is allowed per NCBI account; however, a user may request a new key at any time. Such a request will invalidate any existing API key associated with that NCBI account.

I wasn't getting the "API rate limit exceeded" error, buuut... when I retried some of the PMIDs that threw the error the first using eutils directly, I got a result. I'll try putting in a sleep(1) and seeing what happens

joshlawrimore commented 1 month ago

After playing with delaying the API call in the PMIDs for loop, I found very little difference in the number of eutils.EutilsNCBIError errors. Instead a limited retry strategy worked well with most PMIDs needed only two or three retries to succeed. Updated code was pushed in 1b51da97e7496b98dc1688001e7d34f664d8bdda