greenelab / snorkeling

Extracting biomedical relationships from literature with Snorkel 🏊
Other
59 stars 17 forks source link

Retrieving the pre-processed Snorkel PubMed data #2

Closed dhimmel closed 7 years ago

dhimmel commented 7 years ago

In a private communication, @ajratner wrote:

we have all the pubmed articles pre-processed, tagged with some basic entities (genes, diseases, chemicals, species and mutations), and pre-loaded in Snorkel format, on an internal server; and we'd love to share with you. Do you have any preferred method of transferring large files is? Otherwise I'll figure something out!

@ajratner, awesome. Are chemicals what I'm calling a compound... i.e. a small molecule that could be in DrugBank? What vocabularies are your diseases and chemicals identified in?

How big are the files? How many files are there? The ideal solution would be to use Git LFS. You could fork this repository and create a pull request which adds these files. This would require you to make the files public... and we should consider whether we need to exclude them from the repo's licensing.

ajratner commented 7 years ago

Hi @dhimmel I was just going to ask you the same question on ThinkLab... the PubTator annotations that we were using map to MeSH and/or ChEBI identifiers (e.g. see https://doi.org/10.1186/1758-2946-7-S1-S3).

I don't know about licensing, but Git LFS sounds like a good idea; I was thinking about this anyway because I need to transfer to some others too. I have a little cleanup to do first but I'll try to get to this soon

dhimmel commented 7 years ago

From the tmChem paper:

Our lexicon of chemical entities and their names was collected from MeSH [32] and ChEBI [33]. The system converts both mentions from the literature and entity names in the lexicon to lowercase and removes all whitespace and punctuation. For example, "flavone-C-glycoside" becomes "flavonecglycoside." The system then assigns a MeSH identifier to those mentions which can be found in the lexicon, or a ChEBI identifier if a matching MeSH identifier cannot be found. Mentions that correspond to a short form recognized by Ab3P are assigned the same identifier as the long form found by Ab3P [29]. Mentions which do not map to a specific identifier are ignored and mentions which can be assigned to both a MeSH and ChEBI identifier are only assigned the MeSH identifier.

So it seems like we should convert our drugbank identifiers to MeSH and ChEBI. We already have a mapping to ChEBI based on structure. I'll look into how to map DrugBank to MeSH.

dhimmel commented 7 years ago

See this comment for more information on PubTator mapping. In short:

@ajratner, so in the pre-processed / snorkel-formatted data, are all PubTator tags included? In other words, will chemicals, diseases, and genes all be tagged?

ajratner commented 7 years ago

Yeah; we're just re-running due to a new format change and then will try to get posted somewhere accessible!