greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
59 stars 17 forks source link

Convert bioconcepts2pubtator_offsets to bioC format #11

Closed danich1 closed 7 years ago

danich1 commented 7 years ago

This pull request is designed for converting Pubtator's abstracts annotations into Pubtator's xml format and to grab annotations in tsv format for our custom Tagger class in #10.

There is a lot going on in pubtator_to_xml.py but the central idea is to take input like this:

27117697|t|Quality assessment of studies comparing percutaneous ablative treatments in hepatocellular carcinoma. 27117697|a| 27117697 76 100 hepatocellular carcinoma Disease MESH:D006528

And convert it into this format:

\<collection> \<source>PubTator\<\/source> \<key>PubTator.key\</key> \<document> \<id>27117697\</id> \<passage> \<infon key="type">title\</infon> \<offset>0\</offset> \<text>Quality assessment of studies comparing percutaneous ablative treatments in hepatocellular carcinoma.\</text> \<annotation id="0"> \<infon key="type"> Disease \</infon> \<location length="5" offset="6">\</location> \<text>hepatocellular carcinoma\</text> \<infon key="MESH">D006528\</infon> \</annotation> \</passage> \<passage> \<infon key="type">abstract\</infon> \<offset>101\</offset> \<text>\</text> \</passage> \</document> \</collection>

dhimmel commented 7 years ago

@danich1 for the two file formats you mention above, can you comment on where there are from and where they are consumed? Are these formats that we're choosing, or that we're obligated to use since they're related to PubTator/Snorkel?

danich1 commented 7 years ago

The first file format comes directly from the Pubtator's ftp site. I downloaded the bioconcepts2pubtator.gz bioconcepts2pubtator_offsets.gz file, which consists of all the Bioconcept's annotations for each pubmed abstract that pubtator has stored. We are forced to deal with the raw data in this format, because I don't believe there is a way to get around this.

In regards to the second file format, this is referenced in Pubtator's Tutorial and it looks like the format is BioC, which is an xml format specifically designed to share text data along with its corresponding annotations. We are free to use other formats for keeping track of these annotations; however, I believe that using BioC's xml format is the way to go because it is the easiest to work with. (AKA snorkel has their cdr tutorial which uses this specific format.)

dhimmel commented 7 years ago

bioconcepts2pubtator looks like a TSV to me according to the sample (ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/bioconcepts2pubtator.sample). For example, the head of the sample is:

PMID    Type    Database    Identifier  Mentions    Resource
10  Chemical    MESH|CHEBI  MESH:D004074    Digitoxin   MeSH|tmChem
1000006 Chemical    MESH|CHEBI  MESH:D004958    estradiol   MeSH|tmChem
1000006 Chemical    MESH|CHEBI  MESH:D011374    progesterone    MeSH|tmChem
1000007 Chemical    MESH|CHEBI  MESH:D004967    estrogen    MeSH

Nice to see that BioC is supposedly a standardized format. It looks like they have a python package for reading and writing BioC files -- however, it's not clear from the GitHub that's it's a reliable package. However, maybe worth a try.

dhimmel commented 7 years ago

@danich1 I think you may have been referring to bioconcepts2pubtator_offsets (ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTator/bioconcepts2pubtator_offsets.sample). The head of this file's sample for the first article is:

27037803|t|A case of cervical esophageal duplication cyst in a newborn infant.
27037803|a|Esophageal duplication cyst is a rare congenital anomaly resulting from a foregut budding error during the fourth to sixth week of embryonic development. Cervical esophageal duplication cysts are very rare and may cause respiratory distress in infancy. A full-term newborn girl who was born by normal delivery was transferred to our hospital because of swelling of the right anterior neck since birth. Cervical ultrasonography showed a 40   *   24   *   33  mm simple cyst on the right neck. Tracheal intubation was required at 2  weeks of age because of worsening external compression of the trachea. Fine-needle aspiration cytology revealed the existence of ciliated epithelium. At 1  month of age, exploration was performed through a transverse neck incision. The cyst had a layer of muscle connected to the lateral wall of the esophagus. Histopathological diagnosis was a cervical esophageal duplication cyst. We describe the clinical features of infantile cervical esophageal duplication cysts based on our experience of this rare disease in a neonate, along with a review of 19 cases previously reported in literature.
27037803    106 124 congenital anomaly  Disease MESH:D000013
27037803    288 308 respiratory distress    Disease MESH:D012128
27037803    452 456 neck    Disease MESH:D006258
27037803    554 558 neck    Disease MESH:D006258
27037803    661 668 trachea Disease MESH:D055090
27037803    816 820 neck    Disease MESH:D006258
27037803    1099    1111    rare disease    Disease MESH:D035583
27037803    60  66  infant  Species 9606
27037803    341 345 girl    Species 9606

This file has the full text.

dhimmel commented 7 years ago

@ajratner, what is the best format of tagged literature for snorkel to consume?

Should we be converting the full pubtator export to BioC or do you recommend something else?

dhimmel commented 7 years ago

PubTator parsing is now in a separate repo. https://github.com/greenelab/pubtator/pull/2 continues this pull request.