Open ghost opened 4 years ago
Greetings,
Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz
file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script:
annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)
with
fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]]
annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)
This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions.
great! it works!
Is there a way to use pubtator to parse a text into their tags and frequencies?
Thank you.
On Tue, Nov 26, 2019 at 10:18 AM David Nicholson notifications@github.com wrote:
Greetings,
Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script:
annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)
with
fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]] annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)
This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGECRQBBQGOHOTCOLVLQVU44JA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGL4CQ#issuecomment-558677514, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGHT4BMVYERFF5H437DQVU44JANCNFSM4JRY3EBQ .
Take a look at the extract_tags.py
script. This script is designed to extract tags from the pubtator xml file. Command to use is:
python scripts/extract_tags.py \
--input data/pubtator-docs.xml.xz \
--output data/pubtator-tags.tsv.xz
Once the process has finished you can easily count the frequency of tags.
Hi David
Thank you and I saw that script.
I was wondering if there's a way to process a non-xml text or string?
On Tue, Dec 3, 2019 at 10:34 AM David Nicholson notifications@github.com wrote:
Take a look at the extract_tags.py script. This script is designed to extract tags from the pubtator xml file. Command to use is:
python scripts/extract_tags.py \ --input data/pubtator-docs.xml.xz \ --output data/pubtator-tags.tsv.xz
Once the process has finished you can easily count the frequency of tags.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGDY6HLOUB33KXQEUSDQWZ4ABA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZY6RQ#issuecomment-561221446, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGDYEXKEKX3XWQCB3ELQWZ4ABANCNFSM4JRY3EBQ .
You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension.
Would you be able to point me to the appropriate functions to look at, basically to perform the annotation function in pubtator: i) identifying bio-entities and ii) identify relationships between entities.
Thank you.
On Tue, Dec 3, 2019 at 10:56 AM David Nicholson notifications@github.com wrote:
You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/pubtator/issues/18?email_source=notifications&email_token=AAIBKGG3KEYYBYHFXL5MUC3QWZ6TXA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZ3QUY#issuecomment-561231955, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBKGEGXJER6MLJTFBFDJLQWZ6TXANCNFSM4JRY3EBQ .
Should be pretty straight forward looking at the function within the extract_tags.py
script. All it does is open the compressed file then has the etree package do the parsing. In your case you won't need/have access to the etree library as it is specifically designed to parse xml like tags. Instead you will just handle the raw text and parse it based on your situational needs.
ii) identify relationships between entities.
I want to clarify that this project is only designed to identify tags. It doesn't have the capability to detect relationship between entities. You'd have to look at other places for that kind of detection or do a manual inspection of the results.
Hi
I would like to try out pubtator and was running execute.sh and it gave an error:
pls advise. Thank you.