opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Script for grounding/normalising targets and diseases extracted from literature using OTAR APIs #996

Closed tsantosh7 closed 3 years ago

tsantosh7 commented 4 years ago

Please generate a script that takes in a JSON series of targets (genes/proteins) and diseases and provides a database ID reference (grounding/normalisation) utilising the OTAR APIs.

The input for the script can be found at https://drive.google.com/file/d/1jx9hNCIE7hsnJAlmq3W-xEvolv2taggZ/view?usp=sharing

discarded format: https://drive.google.com/open?id=1NdVamYvHg32C6tvuaHTiFsSXas0Qh4vO

d0choa commented 4 years ago

Hi @tsantosh7. I hope you are well. We might start looking at this task this week.

I'm assuming the above file is for cases where GP and DS co-occur. We talked about generating this file before the target-disease co-occurrence. In that case, we could have papers where only 1 of the entities occur. Also it would be good to be able to include the "drug/chemical" entity in a sample file.

Could you generate a similar sample file to the above with these characteristics? We will in the meantime use the previous file.

tsantosh7 commented 4 years ago

Hi @d0choa Thank you and hope you are doing well too!

Yes the above file is where GP and DS co-occur. Yes, then It will then be after the ML False positive removal. I will try to include drug/chemical and generate a new jsonl file with the aforementioned characteristics. Meanwhile please try using the previous file to get things started.

mkarmona commented 4 years ago

@tsantosh7 should I expect any string normalisation before generating this file? I got a random example and I can see entities are not clean enough

{
  "PMC1241207": {
    "GP": [
      "T-helper type 1",
      "T(H)1",
      "NF-Kappa B",
      "NF-Kappa B.",
      "nuclear factor-kappaB"
    ],
    "DS": [
      "infections",
      "viral upper respiratory tract infections",
      "nonallergic",
      "Asthma",
      "asthma"
    ]
  }
}
tsantosh7 commented 4 years ago

@mkarmona there is no string normalisation in place. Those strings are extracted as how they appear in the text written by the authors. Do you want me to group similar strings together if it makes it easy for you?

@d0choa Between I am working on extracting drugs/chemicals as well. So @mkarmona will receive another sample file with the addition of drugs/chemicals by end of this week.

mkarmona commented 4 years ago

@tsantosh7 you can forget about doing any string transformation but the JSON you are outputting is not the most suitable for me. Can you output in this way? please not that you are producing a dump of a dict where the key is the value itself and I need to have a proper common key, by example, pmid so now the serialised json (that is pretty-printed, I don't need it pretty-printed is just to make it more readable but rather serialised as you are currently doing) is an object each line where the keys are pmid, GP and DS and perhaps another for the list of drugs tokens. Thanks

{
  "pmid": "PMC1241207",
  "GP": [
      "T-helper type 1",
      "T(H)1",
      "NF-Kappa B",
      "NF-Kappa B.",
      "nuclear factor-kappaB"
    ],
  "DS": [
      "infections",
      "viral upper respiratory tract infections",
      "nonallergic",
      "Asthma",
      "asthma"
  ]
}
mkarmona commented 4 years ago

@tsantosh7 it could be great if you generate the same file but in the way I need it to keep working on it while you finish the one with drugs included. thanks in advance!

tsantosh7 commented 4 years ago

@mkarmona the updated file in the format you have asked for: https://drive.google.com/file/d/1jx9hNCIE7hsnJAlmq3W-xEvolv2taggZ/view?usp=sharing

I will try to include drugs as well now in the same format you have asked for. I am generating this on whole submission data, so it will take a bit of time. Thank you for your patience :)

mkarmona commented 4 years ago

@tsantosh7 here you have a run from the last file you gave me mapped_terms.jsonl.gz

tsantosh7 commented 4 years ago

:+1: Thank you @mkarmona. I will have a discussion with @aravindvenkatesan and get back to you

mkarmona commented 4 years ago

@tsantosh7 I generated another file. this is another iteration that contains drug filtering. few things I got. thanks!

mapped_terms.jsonl.gz

mkarmona commented 4 years ago

@tsantosh7 I improved the mapping and I am also including mapped terms separated with raw data as you can see where is that coming from. I do not interfere in your detection so I separated direct mappings as entities you classify from the ones I detect crossed entity

mapped_terms.jsonl.gz

tsantosh7 commented 4 years ago

Thank you @mkarmona for giving us 3 options. I have a meeting with @aravindvenkatesan in few hours from now. I will discuss with him and come up with a format. For now I think the first option looks easy to map the terms from our side. Anyways I will update you before the close of play today.

mkarmona commented 4 years ago

Perfect @tsantosh7 . Please consider this latest file as we think it could be the most appropriate for all. IMHO i think is the less opinionated of the all attempts

tsantosh7 commented 4 years ago

@mkarmona I just compared all the three. I agree :+1: that the last option looks good too. However I will confirm later today.

mkarmona commented 4 years ago

@tsantosh7 here a bit more elaborated file and what each field means mapped_terms.jsonl.gz

{
  "pmid": "PMC1241358",
  "terms_mapped": [
    {
      "term_raw": "allergy",
      "term_norm": "allergy",
      "id": "EFO_0003785",
      "term_type": "disease",
      "keyword_type": "disease"
    }
  ],
  "terms_not_mapped": [
    {
      "term_raw": "IgE antibody",
      "term_norm": "igeantibody",
      "term_type": "target"
    },
...
    {
      "term_raw": "allergic diseases",
      "term_norm": "allergicdiseases",
      "term_type": "disease"
    }
  ],
  "targets_mapped": [],
  "diseases_mapped": [
    {
      "term_raw": "allergy",
      "term_norm": "allergy",
      "id": "EFO_0003785",
      "term_type": "disease",
      "keyword_type": "disease"
    }
  ],
  "drugs_mapped": [],
  "cross_mapped": []
}

term_raw and term_type (already classified in the input file) are provided by the input file

tsantosh7 commented 4 years ago

Hi @mkarmona and @d0choa

Thank you very much, this is working so well. From our analysis today, here is our feedback:

Example considered: PM_ID: PMC1240767

Option_1: We cannot compare original terms with the normalised terms as they are different. E;g: APLP-2 vs aplp2 and Alzheimer disease vs alzheimerdisease

Option_2: Its nice to see the drug filter added to the json stream. However it has the demerits listed from option-1, i.e., missing of the raw terms.

Option_3: This format solves the aforementioned issues. The raw terms (original terms) are mapped to the normalised ones, making it easier to compare and link at end our end. However, it would be nice if you could add all the list of entities that were originally recieved by the API. This will enable us to do some validations later on.

Other issues:

  1. What aboutthose terms that do not have entity linking? for E.g.: neuronal degeneration ? Should we store all those missed entities into a database?

So, to summarise, as @mkarmona suggested option_3 works well for both your team and our team. We would request you to add the originally received entities as well just to make the validations easier when post processing.

mkarmona commented 4 years ago

Great @tsantosh7 . Thanks. You can find in the latest iteration I commented I solved the issue you mention so your initial provided terms are there.

tsantosh7 commented 4 years ago

Thank you @mkarmona. Perfect!!

tsantosh7 commented 4 years ago

Hi @mkarmona

cced - @aravindvenkatesan

Thank you for the amazing stuff. I have few production level questions:

  1. How would I communicate with your service?
  2. Would you have a REST endpoint, where I post my jsonl.file? if so, the output of your end point could be a jsonified dict of json lines?

Thank you!

mkarmona commented 4 years ago

Hi, @tsantosh7 and @aravindvenkatesan thanks! firstly,

It is currently an Apache Spark script that runs anywhere (the requirements are just a really few). It is currently alpha status so

  1. it is not properly wrapped in a standalone jar that anyone can run anywhere whether you run locally or run in a Hadoop cluster supporting Apache Spark 3.x without any modification, and
  2. it will be placed in its own repo where you can easily download the produced jar.

The way it works it takes 2 inputs where the lookup tables are and where the input files to transform (this is the one you generates @tsantosh7 ) and specify an output folder too. The output format is a dataset composed of multiple files in a folder where each file contains serialised JSON-lines. that way you can further consume this output for post-processing at scale.

The output you are generating to feed this script

  1. has to be serialised JSON-lines but not the way you gave it initially. it needs to use common keys.
  2. it does not matter you produce one or many files but ensuring you meet with the serialised JSON-lines

This comment clarifies this point further

Regarding your question about the place to put the dataset you produce it could be temporally any that can be accessed in order to download and ensure it does all properly. you could potentially upload to some of our GCS buckets

tsantosh7 commented 4 years ago

@mkarmona Thank you, please give me a weeks time up to deal with some practicality issues.

mkarmona commented 4 years ago

@tsantosh7 sure! Please tell me if I can help with any scalability or etl related issue. Here you have a pair of hands if needed

tsantosh7 commented 4 years ago

@mkarmona and @d0choa Sorry for getting back late. I am still working on extracting the drugs and chemicals from all the data. Now that I have datasets created for it, will try to develop model today. On the other hand, I had been thinking about the integration of @mkarmona normalisation program into this architecture. Previously I thought its a rest API service but then it seems different to what I have planned. Nevertheless, I feel this is good in the sense that its faster than calling it as a service. In addition to this, I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process please.

if you are available @mkarmona could we have a brief meeting on Tuesday? I am available anytime from 9 to 5:30.

mkarmona commented 4 years ago

@tsantosh7 yes happy to meet on Tuesday. please place gmeet on the gcalendar and count me in. Regarding

I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process

Can you please include some example about this to discuss about? I mean, the transformation I do are quite minimal and try to keep the nature of the word but things like upercase, spaces, ending punctuations or excesive parentheses stop the pure string match working. I am not currently implementing any fuzzy / similarity algorithm to match as I dont want to put another variable into the system.

The code that clean the sentence is quite simple though

def normalise(c: Column): Column = 
  lower(translate(trim(trim(c), "."), "[]{}()'- ", ""))

@tsantosh7 do you think that makes sense remove plurals in order to match with a disease?

mkarmona commented 4 years ago

@tsantosh7 it could great if you can generate a full set without drugs to check performance and acurracy on my side at least. The more iterations we do the better the integration and the sooner we catch potential problems.

tsantosh7 commented 4 years ago

@mkarmona Thanks for this. I will generate without drugs first. You will have this ready by Monday.

tsantosh7 commented 4 years ago

@tsantosh7 yes happy to meet on Tuesday. please place gmeet on the gcalendar and count me in. Regarding

I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process

Can you please include some example about this to discuss about? I mean, the transformation I do are quite minimal and try to keep the nature of the word but things like upercase, spaces, ending punctuations or excesive parentheses stop the pure string match working. I am not currently implementing any fuzzy / similarity algorithm to match as I dont want to put another variable into the system.

The code that clean the sentence is quite simple though

def normalise(c: Column): Column = 
  lower(translate(trim(trim(c), "."), "[]{}()'- ", ""))

@tsantosh7 do you think that makes sense remove plurals in order to match with a disease?

tsantosh7 commented 4 years ago

@mkarmona Please find the GP and DS tags from all the articles: https://drive.google.com/file/d/1IpOK5hj1DnWdr3ycrs_ztnOOOgAqGtWN/view?usp=sharing

mkarmona commented 4 years ago

thanks, @tsantosh7 ! I already mapped them. It took a couple of minutes in my laptop but it generates 14GB in total so I am compressing them and uploading to drive to share the results with you. Unfortunately, It will take more time to upload than compute it.

tsantosh7 commented 4 years ago

That was pretty quick @mkarmona. I will try to send you drugs as early as possible.

mkarmona commented 4 years ago

@tsantosh7 here the first mapping pass

tsantosh7 commented 4 years ago

Thank you @mkarmona

mkarmona commented 4 years ago

@aravindvenkatesan @tsantosh7 here some counts from the resulted mappings

d0choa commented 4 years ago

A problem with intrinsic being picked up by ML and grounded by this pipeline, due to the word intrinsic showing as a synonym of the disease has been reported to MONDO

tsantosh7 commented 4 years ago

@d0choa There was no ML filtering for disease. So, it was purely the dictionary output. ML tagged it as a whole "urethral intrinsic sphincter deficiency". We could discuss about this in our monthly meeting.

Between, @mkarmona's normalisation is working well. This will open our analysis to introducing ML based annotations as well into pipeline, which are then normalised using @mkarmona script. Very exciting stuff!!

mkarmona commented 4 years ago

@tsantosh7 thanks. The moment I have drugs incorporated I will also commit few improvements as plurals and Greek letters transformations. These two fixes will bring few new terms detected many times across literature.

tsantosh7 commented 4 years ago

@mkarmona cheers, I am on it.

tsantosh7 commented 4 years ago

@mkarmona and @d0choa, as discussed, Here are the abstracts with GP, DS and CD tags https://drive.google.com/file/d/1OyvzMk971iX9d5Ohx_tSCEMgBqeUyWPb/view?usp=sharing

Something like this:

{
  "pmid": "PMID31911178",
  "GP": [
    "NR2B",
    "Bcl-2",
    "Bax",
    "p38"
  ],
  "DS": [
    "MMPs",
    "depression"
  ],
  "CD": [
    "ferulate",
    "glutamate",
    "Coniferyl ferulate",
    "lactate",
    "malondialdehyde"
  ]
}

Another example

{
  "pmid": "PMID31911267",
  "GP": [
    "TNF-α",
    "IL-6",
    "IL-1β"
  ],
  "DS": [
    "neurodegenerative diseases"
  ],
  "CD": [
    "princepin",
    "americanin B",
    "sesquineolignans",
    "diverniciasin B",
    "dineolignans",
    "isoprincepin",
    "isoamericanin A",
    "phenylpropanoids",
    "neolignans",
    "diverniciasin C",
    "isodiverniciasin A",
    "monolignan",
    "steroid"
  ]
}

The model was not scaling to Fulltext articles due to the size issue. I just understood on how to solve this bug. So I will fix and try to send you Fulltext tags, close to end of play today.

Known Limitations with CD tags: Due to a tokenizing issue while training, if there is a special character between a term, sometimes, it treats as two terms. It is happening with '(' and '-'. I will need to fix this but will do it next week and regenerate these tags again.

@mkarmona let me know how you find the CDs.

Best wishes Santosh

mkarmona commented 4 years ago

@tsantosh7 here the results. it looks quite good!

mkarmona commented 4 years ago

@tsantosh7 nice improvement!

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count 
res22: Long = 1233637L
@ terms.filter(size($"targets_mapped") === 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count 
res23: Long = 235435L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") === 0 and size($"drugs_mapped") > 0).count 
res24: Long = 585695L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0).count 
res25: Long = 2996961L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") === 0).count 
res26: Long = 1812479L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") > 0).count 
res27: Long = 1184482L
tsantosh7 commented 4 years ago

@mkarmona Just to update you that the program is still running on the Fulltext articles. Its been 66hrs and only 60% of the articles have been processed. I will try to give you by tomorrow hoping they get processed.

@d0choa I learnt some lessons during this, I can talk about them during our meeting this month.

  1. Running the CD model (or GP, DS) on each sentence is time-consuming on the Fulltext. So it is best to run the CD model on only those that have GP or DS tags.
  2. So, if we would like to introduce FNs from GP, DS, it is better to combine the GP, DS and CD datasets and train a new lighter model. In the next version, we can think about this.
  3. For this version, it is better we run the CD model on only those sentences that have GP or DS tags.
mkarmona commented 4 years ago

Please @tsantosh7 can you reply with a simple list of your terms and what those stand for?

CD, GP, DS, FN....

tsantosh7 commented 4 years ago

@mkarmona Sure

CD: Chemical/Drug GP: Gene/Protein DS: Disease FN: False negatives (The ones that are missed by dictionaries)

AsierGonzalez commented 4 years ago

What about DL?

d0choa commented 4 years ago

We can iterate, but we will eventually need to run the CD model on all sentences (not only GP or DS tagged sentences). It's the only way we can list the articles that mention a particular drug/chemical isolated or in the context of other entities

tsantosh7 commented 4 years ago

@AsierGonzalez DL is Deep learning. I now have replaced it with CD to make it specific to the context here.

tsantosh7 commented 4 years ago

@d0choa Thank you. I totally understand. It takes a lot of extra time but nonetheless it's doable. So, I will try to run on all the sentences for this iteration too. For our next iteration, it is best that we produce 1) one standard model by fusing CD dataset with GP and DS, and 2) a lighter version of the current DL model.

tsantosh7 commented 4 years ago

@mkarmona it took more than 5 days to process 90% of the fulltext. Here is the file https://drive.google.com/file/d/10k4q1t0ZxOusZhe3fYmyQW7mTjOMp8-Y/view?usp=sharing

Kindly run your analysis on this while I will investigate ways to make the model run faster.

Here is the time taken for processing all the Fulltext: https://drive.google.com/file/d/1tleKve8giDOIJWCsC_9DzAsjLCXFS9Uc/view?usp=sharing

mkarmona commented 4 years ago

@tsantosh7 here the result of the full papers output and some conditional counts

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count 
res3: Long = 949540L

@ terms.filter(size($"targets_mapped") === 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count 
res4: Long = 318032L

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") === 0 and size($"drugs_mapped") > 0).count 
res5: Long = 24724L

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0).count 
res6: Long = 1092721L

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") === 0).count 
res7: Long = 94244L

@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") > 0).count 
res8: Long = 998477L

@ terms.count 
res9: Long = 1929399L
tsantosh7 commented 4 years ago

@mkarmona would you have time next week (Monday) for a call with me and Aravind?