maximilianh / pubMunch

various tools to download, convert and process the full text of scientific articles
https://github.com/maximilianh/pubMunch#overview
52 stars 21 forks source link

the error with pubCrawl2 #12

Closed InfiniteSynthesis closed 3 years ago

InfiniteSynthesis commented 6 years ago

Hello, when I use the tool pubCrawl2 to crawl files with thousands of pmids in pubmed, it often break down with the error: ssl.sslError: read operation timed out

how can I solve it?? thanks

maximilianh commented 6 years ago

Sorry but I need to see the complete error message. There are many places where this error could appera.

On Tue, May 22, 2018 at 10:13 AM, InfiniteSynthesis < notifications@github.com> wrote:

Hello, when I use the tool pubCrawl2 to crawl files with thousands of pmids in pubmed, it often break down with the error: ssl.sslError: read operation timed out

how can I solve it?? thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/pubMunch/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TZNuOWtvXBOz9JUkmw5XplHf5DqQks5t1EdNgaJpZM4UJDtJ .

InfiniteSynthesis commented 6 years ago

The error message was like this: Traceback (most recent call last): File "./pubCrawl2", line 193, in main(args, options) File "./pubCrawl2", line 98, in main scrapeLib.crawlDocuments(docIds, skipIssns, options.forceContinue) File "/home/tcv/pubMunch/lib/pubCrawlLib.py", line 3457, in crawlDocuments paperData = crawlOneDoc(artMeta, srcDir) File "/home/tcv/pubMunch/lib/pubCrawlLib.py", line 3337, in crawlOneDoc crawlers, landingUrl = selectCrawlers(artMeta, srcDir) File "/home/tcv/pubMunch/lib/pubCrawlLib.py", line 3316, in selectCrawlers landingUrl = getLandingUrlSearchEngine(artMeta) File "/home/tcv/pubMunch/lib/pubCrawlLib.py", line 259, in getLandingUrlSearchEngine xrDoi = pubCrossRef.lookupDoi(articleData) File "/home/tcv/pubMunch/lib/pubCrossRef.py", line 42, in lookupDoi jsonStr = httpResp.read() File "/usr/lib/python2.7/socket.py", line 355, in read data = self._sock.recv(rbufsize) File "/usr/lib/python2.7/httplib.py", line 597, in read s = self.fp.read(amt) File "/usr/lib/python2.7/socket.py", line 384, in read data = self._sock.recv(left) File "/usr/lib/python2.7/ssl.py", line 772, in recv return self.read(buflen) File "/usr/lib/python2.7/ssl.py", line 659, in read v = self._sslobj.read(len) ssl.SSLError: ('The read operation timed out',)

what's more, I have to download many pmids. So I cut the list of pmids into pieces, and I opened several terminals to execute it at the same time. Does this affects or is there any other solution to accelerate the download?

maximilianh commented 6 years ago

This happens when it's trying to contact crossref to find the DOI of the article. It looks like crossref doesn't always reply at the moment. You could simply put a try: / except: around line 42 in pubCrossRef (this line: "jsonStr = httpResp.read()") and repeat the request if you get an "ssl.SSLError". Can you do that?

maximilianh commented 6 years ago

As for downlaoading many PMIDs, you can reduce the waiting time in pubCrawl2 (there is an option for it, I think it's -t). But be careful, as the publishers may block you at some point. May I ask what you're ultimately trying to do?

InfiniteSynthesis commented 6 years ago

I add the try:/ except at the place you mentioned and now it seems working well. Thank you very much!

And for what I am trying to do... I am an undergraduate student, and I read your article "AMELIE accelerates Mendelian patient diagnosis directly from the primary literature" by chance. I feel quite intereted in the text mining and the whole system of medical articles(though I still not very clear about it now..) So I tried to work under the methods behind the aritcle. Now I have downloaded the titles and abstracts in pubmed, and the classifier using omim and unomim articles have been constructed. I find about one million articles and I am downloading the full text of them.

I feel quite cheerful since you helped me solve the question that troubles me for a long time. =v=

maximilianh commented 6 years ago

Hey, that's great to hear, awesome that you got it to run!

could you tell me exactly which change you made? If you feel adventurous, you can even send me a pull request? https://help.github.com/articles/creating-a-pull-request/

On Thu, May 24, 2018 at 10:24 AM, InfiniteSynthesis < notifications@github.com> wrote:

I add the try:/ except at the place you mentioned and now it seems working well. Thank you very much!

And for what I am trying to do... I am an undergraduate student, and I read your article "AMELIE accelerates Mendelian patient diagnosis directly from the primary literature" by chance. I feel quite intereted in the text mining and the whole system of medical articles(though I still not very clear about it now..) So I tried to work under the methods behind the aritcle. Now I have downloaded the titles and abstracts in pubmed, and the classifier using omim and unomim articles have been constructed. I find about one million articles and I am downloading the full text of them.

I feel quite cheerful since you helped me solve the question that troubles me for a long time. =v=

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/pubMunch/issues/12#issuecomment-391794643, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TaaB1jUde84SHrnA5RJY0q9m_Rbzks5t1uzSgaJpZM4UJDtJ .

InfiniteSynthesis commented 6 years ago

I have sent you the pull request. Now I have 12 terminals to execute the pubCrawl2 at the same time. Before I make this change, all the terminals will break down after one night. But now they are still running well.

However, I am not familiar with the json module (exactly I am not familiar with python as well) So there may be some mistakes.

emmm, it took me a long time to test it. Sorry to reply so late.

maximilianh commented 6 years ago

This looks great, I've merged it. Let me know how your crawl goes.

On Sun, May 27, 2018 at 8:59 AM, InfiniteSynthesis <notifications@github.com

wrote:

I have sent you the pull request. Now I have 12 terminals to execute the pubCrawl2 at the same time. Before I make this change, all the terminals will break down after one night. And now they are still running well.

However, I am not familiar with the json module (exactly I am not familiar with python as well) So there may be some mistakes.

emmm, it took me a long time to test it. Sorry to reply so late.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/pubMunch/issues/12#issuecomment-392342719, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS-TUpxk2mSZR9eM3oCZl_kjTztFpK_ks5t2s1wgaJpZM4UJDtJ .