question of collect_counts extracted titles

skytguuu commented 2 years ago

Hi,

Sorry to bother. When I used 'collect_counts' to fetch the articles, it successfully showed the number of papers that include the term. However, I want to know the titles of the articles that corresponds to the number. I tried used print(meta_dat), but it had not include the titles. Followed is my code and result: ################## coocs, term_counts, meta_dat = collect_counts( terms_a=terms_a, terms_b=terms_b, db='pubmed', save_and_clear=True,usehistory=True,verbose=True) ##################

Is there a way to extract the articles' titles that corresponds to the number? Thanks for your help! Best

TomDonoghue commented 2 years ago

The counts collection returns counts of co-occurences, not paper data like titles. The words collection is used for collecting articles data like titles.

If you want the titles of words with term co-occurences, you can organize the search terms, for example, using the "include" feature to search terms that include particular co-occurences.

skytguuu commented 2 years ago

Hi,

Thanks for your help. As your advice, I used the collect_words with 'inclusions=term' to collect the titles and it works. But I have another question about the search. Because I have a lot of terms needed to search, so the result was be interrupted by the connection after several terms searching. The error is as followed: ################### ConnectionError: HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /entrez/eutils/esearch.fcgi?db=pubmed&usehistory=y&retmax=100&field=TIAB&retmode=xml&term=(%22oogenesis%22OR%22+oogenetic%22)AND(%22CIDs00000019%22OR%222,3-dihydroxybenzoic+acid%22) (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001732850F910>: Failed to establish a new connection: [WinError 10060] ####################### The problem seemed the ip was blocked by the NCBI. Is there a way to solve it? I tried to use time.sleep(5), however, it failed again. Could you give a hand?

Thank you so much! Best

TomDonoghue commented 2 years ago

Ah, yeh, there could definitely be some robustness things that could be improved, so that's its more robust to missed

Could you let me know approximately how many terms you are searching for, and after how many it tends to fail?

As of right now:

Are you using a EUtils API key? This increases the limit you are able to search at, and might make it less likely for you to get blocked.
One option would be to set it to search more slowly (fewer requests per minute), if you initialize your own requester object with a slower wait time (maybe 0.5 seconds or slower, as default non-authenticated is 0.33), and pass this in as logging=my_requester, the collection will use this wait time.
Another option is to try breaking up the collection into multiple, smaller jobs

skytguuu commented 2 years ago

I am really appreciated for your help. Actually, I have fouty thousands of terms need to be searched. And it failed after 20 terms, sometimes it failed only 3 or 8 terms. I have tried use API key and set the requester time (1s), however it failed in 3 terms again. The followed is my command: ############## my_requester = Requester(wait_time=1) results, meta_data = collect_words(terms=keys, inclusions=term, retmax=100,usehistory=True, save_and_clear=False, verbose=True, logging=my_requester, api_key="72f729883aed8ad3942d8c4fec698a7bff09") ################ It was reported the same error "ConnectionError: ('Connection aborted.', TimeoutError(10060))" . Thanks for your wonderful advise! Best

TomDonoghue commented 2 years ago

Hey @skytguuu - oops, sorry, I accidentally dropped off here. Did you have any luck getting this working? Other than what I've suggested, I don't really have any other suggestions for running this, since the issue appears to be the API connection rather than anything specific to LISC.

skytguuu commented 2 years ago

Hi @TomDonoghue, Thanks for your reply. As you suggested, I have solved this problem by setting the crawler time and adding the sleep time.

lisc-tools / lisc

question of collect_counts extracted titles #77