Closed tsantosh7 closed 3 years ago
Hi @tsantosh7. I hope you are well. We might start looking at this task this week.
I'm assuming the above file is for cases where GP and DS co-occur. We talked about generating this file before the target-disease co-occurrence. In that case, we could have papers where only 1 of the entities occur. Also it would be good to be able to include the "drug/chemical" entity in a sample file.
Could you generate a similar sample file to the above with these characteristics? We will in the meantime use the previous file.
Hi @d0choa Thank you and hope you are doing well too!
Yes the above file is where GP and DS co-occur. Yes, then It will then be after the ML False positive removal. I will try to include drug/chemical and generate a new jsonl file with the aforementioned characteristics. Meanwhile please try using the previous file to get things started.
@tsantosh7 should I expect any string normalisation before generating this file? I got a random example and I can see entities are not clean enough
{
"PMC1241207": {
"GP": [
"T-helper type 1",
"T(H)1",
"NF-Kappa B",
"NF-Kappa B.",
"nuclear factor-kappaB"
],
"DS": [
"infections",
"viral upper respiratory tract infections",
"nonallergic",
"Asthma",
"asthma"
]
}
}
@mkarmona there is no string normalisation in place. Those strings are extracted as how they appear in the text written by the authors. Do you want me to group similar strings together if it makes it easy for you?
@d0choa Between I am working on extracting drugs/chemicals as well. So @mkarmona will receive another sample file with the addition of drugs/chemicals by end of this week.
@tsantosh7 you can forget about doing any string transformation but the JSON you are outputting is not the most suitable for me. Can you output in this way? please not that you are producing a dump of a dict where the key is the value itself and I need to have a proper common key, by example, pmid
so now the serialised json (that is pretty-printed, I don't need it pretty-printed is just to make it more readable but rather serialised as you are currently doing) is an object each line where the keys are pmid
, GP
and DS
and perhaps another for the list of drugs tokens. Thanks
{
"pmid": "PMC1241207",
"GP": [
"T-helper type 1",
"T(H)1",
"NF-Kappa B",
"NF-Kappa B.",
"nuclear factor-kappaB"
],
"DS": [
"infections",
"viral upper respiratory tract infections",
"nonallergic",
"Asthma",
"asthma"
]
}
@tsantosh7 it could be great if you generate the same file but in the way I need it to keep working on it while you finish the one with drugs included. thanks in advance!
@mkarmona the updated file in the format you have asked for: https://drive.google.com/file/d/1jx9hNCIE7hsnJAlmq3W-xEvolv2taggZ/view?usp=sharing
I will try to include drugs as well now in the same format you have asked for. I am generating this on whole submission data, so it will take a bit of time. Thank you for your patience :)
@tsantosh7 here you have a run from the last file you gave me mapped_terms.jsonl.gz
:+1: Thank you @mkarmona. I will have a discussion with @aravindvenkatesan and get back to you
@tsantosh7 I generated another file. this is another iteration that contains drug filtering. few things I got. thanks!
@tsantosh7 I improved the mapping and I am also including mapped terms separated with raw data as you can see where is that coming from. I do not interfere in your detection so I separated direct mappings as entities you classify from the ones I detect crossed entity
Thank you @mkarmona for giving us 3 options. I have a meeting with @aravindvenkatesan in few hours from now. I will discuss with him and come up with a format. For now I think the first option looks easy to map the terms from our side. Anyways I will update you before the close of play today.
Perfect @tsantosh7 . Please consider this latest file as we think it could be the most appropriate for all. IMHO i think is the less opinionated of the all attempts
@mkarmona I just compared all the three. I agree :+1: that the last option looks good too. However I will confirm later today.
@tsantosh7 here a bit more elaborated file and what each field means mapped_terms.jsonl.gz
{
"pmid": "PMC1241358",
"terms_mapped": [
{
"term_raw": "allergy",
"term_norm": "allergy",
"id": "EFO_0003785",
"term_type": "disease",
"keyword_type": "disease"
}
],
"terms_not_mapped": [
{
"term_raw": "IgE antibody",
"term_norm": "igeantibody",
"term_type": "target"
},
...
{
"term_raw": "allergic diseases",
"term_norm": "allergicdiseases",
"term_type": "disease"
}
],
"targets_mapped": [],
"diseases_mapped": [
{
"term_raw": "allergy",
"term_norm": "allergy",
"id": "EFO_0003785",
"term_type": "disease",
"keyword_type": "disease"
}
],
"drugs_mapped": [],
"cross_mapped": []
}
pmid
pubmed id unique field across the datasetterms_mapped
the full list of terms provided and the full list of mapped entities to dictionaries whether their types are the sameterms_not_mapped
list of terms were not possible to map to any element of the full dictionarytargets_mapped
subset of terms_mapped
when both types are the same and == target
diseases_mapped
subset of terms_mapped
when both types are the same and == disease
drugs_mapped
subset of terms_mapped
when both types are the same and == drug
(currently not used)cross_mapped
the ones mapped but their types are not the sameterm_raw
and term_type
(already classified in the input file) are provided by the input file
Hi @mkarmona and @d0choa
Thank you very much, this is working so well. From our analysis today, here is our feedback:
Example considered: PM_ID: PMC1240767
Option_1: We cannot compare original terms with the normalised terms as they are different. E;g: APLP-2 vs aplp2 and Alzheimer disease vs alzheimerdisease
Option_2: Its nice to see the drug filter added to the json stream. However it has the demerits listed from option-1, i.e., missing of the raw terms.
Option_3: This format solves the aforementioned issues. The raw terms (original terms) are mapped to the normalised ones, making it easier to compare and link at end our end. However, it would be nice if you could add all the list of entities that were originally recieved by the API. This will enable us to do some validations later on.
Other issues:
So, to summarise, as @mkarmona suggested option_3 works well for both your team and our team. We would request you to add the originally received entities as well just to make the validations easier when post processing.
Great @tsantosh7 . Thanks. You can find in the latest iteration I commented I solved the issue you mention so your initial provided terms are there.
Thank you @mkarmona. Perfect!!
Hi @mkarmona
cced - @aravindvenkatesan
Thank you for the amazing stuff. I have few production level questions:
Thank you!
Hi, @tsantosh7 and @aravindvenkatesan thanks! firstly,
It is currently an Apache Spark script that runs anywhere (the requirements are just a really few). It is currently alpha status so
The way it works it takes 2 inputs where the lookup tables are and where the input files to transform (this is the one you generates @tsantosh7 ) and specify an output folder too. The output format is a dataset composed of multiple files in a folder where each file contains serialised JSON-lines. that way you can further consume this output for post-processing at scale.
The output you are generating to feed this script
This comment clarifies this point further
Regarding your question about the place to put the dataset you produce it could be temporally any that can be accessed in order to download and ensure it does all properly. you could potentially upload to some of our GCS buckets
@mkarmona Thank you, please give me a weeks time up to deal with some practicality issues.
@tsantosh7 sure! Please tell me if I can help with any scalability or etl related issue. Here you have a pair of hands if needed
@mkarmona and @d0choa Sorry for getting back late. I am still working on extracting the drugs and chemicals from all the data. Now that I have datasets created for it, will try to develop model today. On the other hand, I had been thinking about the integration of @mkarmona normalisation program into this architecture. Previously I thought its a rest API service but then it seems different to what I have planned. Nevertheless, I feel this is good in the sense that its faster than calling it as a service. In addition to this, I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process please.
if you are available @mkarmona could we have a brief meeting on Tuesday? I am available anytime from 9 to 5:30.
@tsantosh7 yes happy to meet on Tuesday. please place gmeet on the gcalendar and count me in. Regarding
I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process
Can you please include some example about this to discuss about? I mean, the transformation I do are quite minimal and try to keep the nature of the word but things like upercase, spaces, ending punctuations or excesive parentheses stop the pure string match working. I am not currently implementing any fuzzy / similarity algorithm to match as I dont want to put another variable into the system.
The code that clean the sentence is quite simple though
def normalise(c: Column): Column =
lower(translate(trim(trim(c), "."), "[]{}()'- ", ""))
@tsantosh7 do you think that makes sense remove plurals in order to match with a disease?
@tsantosh7 it could great if you can generate a full set without drugs to check performance and acurracy on my side at least. The more iterations we do the better the integration and the sooner we catch potential problems.
@mkarmona Thanks for this. I will generate without drugs first. You will have this ready by Monday.
@tsantosh7 yes happy to meet on Tuesday. please place gmeet on the gcalendar and count me in. Regarding
I am more concerned about the accuracy as well. For instance, the normalised terms do not care for any punctuation. I would like to know more about the normalisation process
Can you please include some example about this to discuss about? I mean, the transformation I do are quite minimal and try to keep the nature of the word but things like upercase, spaces, ending punctuations or excesive parentheses stop the pure string match working. I am not currently implementing any fuzzy / similarity algorithm to match as I dont want to put another variable into the system.
The code that clean the sentence is quite simple though
def normalise(c: Column): Column = lower(translate(trim(trim(c), "."), "[]{}()'- ", ""))
@tsantosh7 do you think that makes sense remove plurals in order to match with a disease?
@mkarmona Please find the GP and DS tags from all the articles: https://drive.google.com/file/d/1IpOK5hj1DnWdr3ycrs_ztnOOOgAqGtWN/view?usp=sharing
thanks, @tsantosh7 ! I already mapped them. It took a couple of minutes in my laptop but it generates 14GB in total so I am compressing them and uploading to drive to share the results with you. Unfortunately, It will take more time to upload than compute it.
That was pretty quick @mkarmona. I will try to send you drugs as early as possible.
@tsantosh7 here the first mapping pass
Thank you @mkarmona
@aravindvenkatesan @tsantosh7 here some counts from the resulted mappings
A problem with intrinsic
being picked up by ML and grounded by this pipeline, due to the word intrinsic
showing as a synonym of the disease has been reported to MONDO
@d0choa There was no ML filtering for disease. So, it was purely the dictionary output. ML tagged it as a whole "urethral intrinsic sphincter deficiency". We could discuss about this in our monthly meeting.
Between, @mkarmona's normalisation is working well. This will open our analysis to introducing ML based annotations as well into pipeline, which are then normalised using @mkarmona script. Very exciting stuff!!
@tsantosh7 thanks. The moment I have drugs incorporated I will also commit few improvements as plurals and Greek letters transformations. These two fixes will bring few new terms detected many times across literature.
@mkarmona cheers, I am on it.
@mkarmona and @d0choa, as discussed, Here are the abstracts with GP, DS and CD tags https://drive.google.com/file/d/1OyvzMk971iX9d5Ohx_tSCEMgBqeUyWPb/view?usp=sharing
Something like this:
{
"pmid": "PMID31911178",
"GP": [
"NR2B",
"Bcl-2",
"Bax",
"p38"
],
"DS": [
"MMPs",
"depression"
],
"CD": [
"ferulate",
"glutamate",
"Coniferyl ferulate",
"lactate",
"malondialdehyde"
]
}
Another example
{
"pmid": "PMID31911267",
"GP": [
"TNF-α",
"IL-6",
"IL-1β"
],
"DS": [
"neurodegenerative diseases"
],
"CD": [
"princepin",
"americanin B",
"sesquineolignans",
"diverniciasin B",
"dineolignans",
"isoprincepin",
"isoamericanin A",
"phenylpropanoids",
"neolignans",
"diverniciasin C",
"isodiverniciasin A",
"monolignan",
"steroid"
]
}
The model was not scaling to Fulltext articles due to the size issue. I just understood on how to solve this bug. So I will fix and try to send you Fulltext tags, close to end of play today.
Known Limitations with CD tags: Due to a tokenizing issue while training, if there is a special character between a term, sometimes, it treats as two terms. It is happening with '(' and '-'. I will need to fix this but will do it next week and regenerate these tags again.
@mkarmona let me know how you find the CDs.
Best wishes Santosh
@tsantosh7 nice improvement!
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count
res22: Long = 1233637L
@ terms.filter(size($"targets_mapped") === 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count
res23: Long = 235435L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") === 0 and size($"drugs_mapped") > 0).count
res24: Long = 585695L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0).count
res25: Long = 2996961L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") === 0).count
res26: Long = 1812479L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") > 0).count
res27: Long = 1184482L
@mkarmona Just to update you that the program is still running on the Fulltext articles. Its been 66hrs and only 60% of the articles have been processed. I will try to give you by tomorrow hoping they get processed.
@d0choa I learnt some lessons during this, I can talk about them during our meeting this month.
Please @tsantosh7 can you reply with a simple list of your terms and what those stand for?
CD, GP, DS, FN....
@mkarmona Sure
CD: Chemical/Drug GP: Gene/Protein DS: Disease FN: False negatives (The ones that are missed by dictionaries)
What about DL
?
We can iterate, but we will eventually need to run the CD model on all sentences (not only GP or DS tagged sentences). It's the only way we can list the articles that mention a particular drug/chemical isolated or in the context of other entities
@AsierGonzalez DL is Deep learning. I now have replaced it with CD to make it specific to the context here.
@d0choa Thank you. I totally understand. It takes a lot of extra time but nonetheless it's doable. So, I will try to run on all the sentences for this iteration too. For our next iteration, it is best that we produce 1) one standard model by fusing CD dataset with GP and DS, and 2) a lighter version of the current DL model.
@mkarmona it took more than 5 days to process 90% of the fulltext. Here is the file https://drive.google.com/file/d/10k4q1t0ZxOusZhe3fYmyQW7mTjOMp8-Y/view?usp=sharing
Kindly run your analysis on this while I will investigate ways to make the model run faster.
Here is the time taken for processing all the Fulltext: https://drive.google.com/file/d/1tleKve8giDOIJWCsC_9DzAsjLCXFS9Uc/view?usp=sharing
@tsantosh7 here the result of the full papers output and some conditional counts
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count
res3: Long = 949540L
@ terms.filter(size($"targets_mapped") === 0 and size($"diseases_mapped") > 0 and size($"drugs_mapped") > 0).count
res4: Long = 318032L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") === 0 and size($"drugs_mapped") > 0).count
res5: Long = 24724L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0).count
res6: Long = 1092721L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") === 0).count
res7: Long = 94244L
@ terms.filter(size($"targets_mapped") > 0 and size($"diseases_mapped") > 0 and size($"terms_not_mapped") > 0).count
res8: Long = 998477L
@ terms.count
res9: Long = 1929399L
@mkarmona would you have time next week (Monday) for a call with me and Aravind?
Please generate a script that takes in a JSON series of targets (genes/proteins) and diseases and provides a database ID reference (grounding/normalisation) utilising the OTAR APIs.
The input for the script can be found at https://drive.google.com/file/d/1jx9hNCIE7hsnJAlmq3W-xEvolv2taggZ/view?usp=sharing
discarded format: https://drive.google.com/open?id=1NdVamYvHg32C6tvuaHTiFsSXas0Qh4vO