greenelab / pubtator

Retrieve and process PubTator annotations
Other
43 stars 9 forks source link

Tutorial for its usage #23

Open kcmtest opened 4 years ago

kcmtest commented 4 years ago

Can you put a tutorial for its usage I do see the reporsitory but Im getting confused what Im supposed to run the web version of pubtator is straight forward where I have to just put pmids it returns back the result . I would be glad if you can put a tutorial

I ran this

bash execute.sh
wget: download/bioconcepts2pubtatorcentral_offset.gz.log: No such file or directory

but this exist here "https://github.com/greenelab/pubtator/blob/master/download/bioconcepts2pubtator_offsets.gz.log"

Im not sure what Im doing wrong

danich1 commented 4 years ago

execute.sh cannot find the download folder which is why you are getting the No such file or directory error. Please make sure you are running bash execute.sh is the same level as the download folder.

kcmtest commented 4 years ago

bash execute.sh is the same level as the download folder. so i should be inside my download folder? you mean i can simply get into my download folder and run it?

danich1 commented 4 years ago

bash execute.sh is the same level as the download folder. so i should be inside my download folder? you mean i can simply get into my download folder and run it?

No. You should be outside of the download folder. The current directory should look like this:

data/
download/
mapper/
scripts/
execute.sh
... (other files)

Then run bash execute.sh. The download process should work from there. Just tested the download feature and it works for me.

kcmtest commented 4 years ago

to make my confusion clear i have to copy each of your folder the way you have made ..then only i can run..i was thinking that i would simply run execute.sh and it will work..

danich1 commented 4 years ago

i was thinking that i would simply run execute.sh and it will work..

Right the intended goal here is for execute.sh to handle everything. Did you clone the repository or just download the individual file? Forgive my confusion, but I'm not sure how things are set up on your end.

kcmtest commented 4 years ago

"Did you clone the repository" I will clone it ...and will update you..thank you for clarifying

danich1 commented 4 years ago

"Did you clone the repository" I will clone it ...and will update you..thank you for clarifying

No problem. Please me know if you run into any other issues.

kcmtest commented 4 years ago

I cloned it its running ..how much would be the download size? I would like to know and once i done it do i have to run every now and then?

danich1 commented 4 years ago

I cloned it its running ..how much would be the download size? I would like to know and once i done it do i have to run every now and then?

The file should be about 18+GB. I say 18+ because Pubtator Central updates their server monthly; therefore, your downloaded file should be at least 18GB to be correct.

kcmtest commented 4 years ago

Thank you for the information once its done I will be back with question to bug you again.

With regards

kcmtest commented 4 years ago

I will have to read me file properly before i come back to you. I will run the test example present first. The download is finished i think but last couple of hours this is running not sure what it is its not downloading anything I guess but what it is?

Screenshot from 2020-08-04 12-37-23

kcmtest commented 4 years ago

the bash script is running like 33 hours as of now is it expanding or what exactly is going on? I would be glad to know. So as im not sure so I haven't terminated the process. I would be glad if you can tell me

danich1 commented 4 years ago

the bash script is running like 33 hours as of now is it expanding or what exactly is going on? I would be glad to know. So as im not sure so I haven't terminated the process. I would be glad if you can tell me

The 33 hour process is my pipeline converting pubtator central's annotations into xml format to be processed later. It is a large file that can take up to a few days to fully process. No other solution here but to wait until all the pieces have been completed.

kcmtest commented 4 years ago

Unfortunately the machine was restarted it seems i have to do it again or it can run from where it was there last?

danich1 commented 4 years ago

Unfortunately the machine was restarted it seems i have to do it again or it can run from where it was there last?

The older version of the code required you to start from scratch. The newly updated version allows you to start from anywhere in the pipeline. I highly recommend using the newly upgraded version/read the docs for it. It could make your life easier when restarting the parsers.

kcmtest commented 4 years ago
30988903it [58:10:58, 147.95it/s] 
30988894it [11:25:20, 753.61it/s]  
1097it [2:10:37,  7.14s/it]
sys:1: DtypeWarning: Columns (4,10) have mixed types. Specify dtype option on import or set low_memory=False.
1097it [1:44:13,  5.70s/it]
274it [10:05:12, 132.53s/it]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception

Is this an error or something else do let me know ...im not sure

danich1 commented 4 years ago

This error was generated because Pubtator Central's server sent back an error code. I don't know what caused it, so my suggestion is try rerunning that part of the pipeline and if the error comes again I'll take a look.

kcmtest commented 4 years ago

". I don't know what caused it, so my suggestion is try rerunning that part of the pipeline and if the error comes again I'll take a look." i simply ran this

bash execute.sh

shall i run this again?

danich1 commented 4 years ago

No. Don't do that run this command:

 python scripts/download_full_text.py \
    --input data/pubtator-pmids-to-pmcids.tsv \
    --document_batch 100000 \
    --output data/pubtator-central-full-text.xml

If you run bash execute.sh you will restart everything. Not ideal.

kcmtest commented 4 years ago

thank you for the immediate help

this i got after running the above code sorry for asking these fundamental doubts ..since I use R almost so Im not sure about te errors

download_full_text.py: error: the following arguments are required: --temp_dir

I did make a new folder its running

python scripts/download_full_text.py --input data/pubtator-pmids-to-pmcids.tsv --document_batch 100000 --output data/pubtator-central-full-text.xml --temp_dir /run/media/punit/data4/tupa/
0it [00:00, ?it/s]
kcmtest commented 4 years ago

The error i received after running the above

0it [02:10, ?it/s]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Submitted URI too large!</title>
<link rev="made" href="mailto:info@ncbi.nlm.nih.gov" />
<style type="text/css"><!--/*--><![CDATA[/*><!--*/ 
    body { color: #000000; background-color: #FFFFFF; }
    a:link { color: #0000CC; }
    p, address {margin-left: 3em;}
    span {font-size: smaller;}
/*]]>*/--></style>
</head>

<body>
<h1>Submitted URI too large!</h1>
<p>

    The length of the requested URL exceeds the capacity limit for
    this server. The request cannot be processed.

</p>
<p>
If you think this is a server error, please contact
the <a href="mailto:info@ncbi.nlm.nih.gov">webmaster</a>.

</p>

<h2>Error 414</h2>
<address>
  <a href="/">www.ncbi.nlm.nih.gov</a><br />
  <span>Apache</span>
</address>
</body>
</html>
danich1 commented 4 years ago

Basically the program is sending too many ids to be processed. Change document_batch to be 100 or 1000 and run again. The default parameter is too high for Pubtator Central's api.

kcmtest commented 4 years ago

"Basically the program is sending too many ids to be processed. Change document_batch to be 100 or 1000 and run again. The default parameter is too high for Pubtator Central's api."

okay i will try small numbers

kcmtest commented 4 years ago
python scripts/download_full_text.py --input data/pubtator-pmids-to-pmcids.tsv --document_batch 100 --output data/pubtator-central-full-text.xml --temp_dir /run/media/punit/data4/tupa/
38it [1:26:01, 135.83s/it]
Traceback (most recent call last):
  File "scripts/download_full_text.py", line 124, in <module>
    download_full_text(args.input, args.document_batch, args.temp_dir)
  File "scripts/download_full_text.py", line 58, in download_full_text
    response = call_api(query)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 113, in wrapper
    return func(*args, **kargs)
  File "/home/punit/anaconda3/envs/pubtator/lib/python3.8/site-packages/ratelimit/decorators.py", line 80, in wrapper
    return func(*args, **kargs)
  File "scripts/download_full_text.py", line 21, in call_api
    raise Exception(response.text)
Exception

Please do have a look

I did see the folder i do see xml files around 553 mb a total of 38 files

danich1 commented 4 years ago

For ease of debugging please upload this file: data/pubtator-pmids-to-pmcids.tsv. I'll need it so I can see whats causing the issue.

kcmtest commented 4 years ago

For ease of debugging please upload this file: data/pubtator-pmids-to-pmcids.tsv. I'll need it so I can see whats causing the issue.

sorry for the late reply im doing it now..i will share the link since its more than 10mb https://drive.google.com/file/d/1G-6ehkeR_V8IhqiBryCMVe1jGc9GPB8Y/view?usp=sharing

kcmtest commented 4 years ago

Hello sir ..I would be glad to know what was going wrong on my side ...

cgreene commented 4 years ago

Hi @krushnach80 - you have encountered a research project that is in progress but on someone's back burner at the moment. It sounds like you might be better served by directly interacting with the pubtator API or similar if you need faster responses in this case: https://www.ncbi.nlm.nih.gov/research/pubtator/

kcmtest commented 4 years ago

thank you sir ..i found something which would be easier for me ..https://cran.rstudio.com/web/packages/pubtatordb/vignettes/pubtatordb.html

but i would love to use your tool as well