greenelab / pubtator

Retrieve and process PubTator annotations
Other
43 stars 9 forks source link

error processing bioconcepts2pubtator_offsets.gz #18

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi

I would like to try out pubtator and was running execute.sh and it gave an error:

~/pubtator$ python scripts/pubtator_to_xml.py --documents download/bioconcepts2pubtator_offsets.gz --output data/pubtator-docs.xml.xz
5079543it [2:57:28, 370.11it/s]Traceback (most recent call last):
  File "scripts/pubtator_to_xml.py", line 205, in <module>
    convert_pubtator(args.documents, args.output)
  File "scripts/pubtator_to_xml.py", line 164, in convert_pubtator
    for article in tqdm.tqdm(article_generator):
  File "/home/ksoh/anaconda3/envs/pubtator/lib/python3.8/site-packages/tqdm/std.py", line 1093, in __iter__
    for obj in iterable:
  File "scripts/pubtator_to_xml.py", line 131, in read_bioconcepts2pubtator_offsets
    yield pubtator_stanza_to_article(g)
  File "scripts/pubtator_to_xml.py", line 101, in pubtator_stanza_to_article
    annts = list(annts)
  File "/home/ksoh/anaconda3/envs/pubtator/lib/python3.8/csv.py", line 111, in __next__
    row = next(self.reader)
_csv.Error: line contains NUL
5079543it [2:57:28, 477.03it/s]

pls advise. Thank you.

danich1 commented 4 years ago

Greetings,

Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script:

annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

with

fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]]
annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions.

ghost commented 4 years ago

great! it works!

Is there a way to use pubtator to parse a text into their tags and frequencies?

Thank you.

On Tue, Nov 26, 2019 at 10:18 AM David Nicholson notifications@github.com wrote:

Greetings,

Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script:

annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

with

fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]] annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGECRQBBQGOHOTCOLVLQVU44JA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGL4CQ#issuecomment-558677514, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGHT4BMVYERFF5H437DQVU44JANCNFSM4JRY3EBQ .

danich1 commented 4 years ago

Take a look at the extract_tags.py script. This script is designed to extract tags from the pubtator xml file. Command to use is:

python scripts/extract_tags.py \
  --input data/pubtator-docs.xml.xz \
  --output data/pubtator-tags.tsv.xz

Once the process has finished you can easily count the frequency of tags.

ghost commented 4 years ago

Hi David

Thank you and I saw that script.

I was wondering if there's a way to process a non-xml text or string?

On Tue, Dec 3, 2019 at 10:34 AM David Nicholson notifications@github.com wrote:

Take a look at the extract_tags.py script. This script is designed to extract tags from the pubtator xml file. Command to use is:

python scripts/extract_tags.py \ --input data/pubtator-docs.xml.xz \ --output data/pubtator-tags.tsv.xz

Once the process has finished you can easily count the frequency of tags.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGDY6HLOUB33KXQEUSDQWZ4ABA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZY6RQ#issuecomment-561221446, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGDYEXKEKX3XWQCB3ELQWZ4ABANCNFSM4JRY3EBQ .

danich1 commented 4 years ago

You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension.

ghost commented 4 years ago

Would you be able to point me to the appropriate functions to look at, basically to perform the annotation function in pubtator: i) identifying bio-entities and ii) identify relationships between entities.

Thank you.

On Tue, Dec 3, 2019 at 10:56 AM David Nicholson notifications@github.com wrote:

You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGG3KEYYBYHFXL5MUC3QWZ6TXA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZ3QUY#issuecomment-561231955, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGEGXJER6MLJTFBFDJLQWZ6TXANCNFSM4JRY3EBQ .

danich1 commented 4 years ago

Should be pretty straight forward looking at the function within the extract_tags.py script. All it does is open the compressed file then has the etree package do the parsing. In your case you won't need/have access to the etree library as it is specifically designed to parse xml like tags. Instead you will just handle the raw text and parse it based on your situational needs.

ii) identify relationships between entities.

I want to clarify that this project is only designed to identify tags. It doesn't have the capability to detect relationship between entities. You'd have to look at other places for that kind of detection or do a manual inspection of the results.