Closed guerrerosimonl closed 4 years ago
Hi Laura,
Thanks for your questions!
from indra.sources import eidos
ep = eidos.process_json_file('...')
stmts = ep.statements
stmts
will be a list of "raw" Statements, each of which will have a single piece of evidence in its evidence
list. At this point, two Statements can be duplicates of each other (i.e., the same causal relationship extracted from multiple sentences/documents) and their beliefs are not yet determined.
You have to call INDRA's assembly modules to do further processing. These are all documented, for instance, the Preassembler (which deals with de-duplication and finding hierarchical refinements) is documented here: https://indra.readthedocs.io/en/latest/modules/preassembler/preassembler.html. As a user, probably the easiest way to run INDRA assembly is through the functional interface of the indra.tools.assemble_corpus
tool (documented at https://indra.readthedocs.io/en/latest/modules/tools/index.html#module-indra.tools.assemble_corpus). The assemble_corpus
tool gives you access to various assembly functions and filters over a list of statements. Continuing the example above, you could do
from indra.tools import assemble_corpus as ac
assembled_stmts = ac.run_preassembly(stmts, **extra_args)
here the optional extra_args are things you can configure to change the way assembly is done (documented here: https://indra.readthedocs.io/en/latest/modules/tools/index.html#indra.tools.assemble_corpus.run_preassembly).
At the end of this process, you will have "assembled" statements in the assembled_stmts
list that can have multiple pieces of evidence aggregated in their evidence
list, and will have belief scores calculated according to overall support.
r^e + s
is calculated where r
is the random error rate for the given source, s
is the systematic error rate for the given source and e
is the number of evidences from that source. The product of these error terms across all sources becomes the error probability, and one minus this error probability becomes the prior belief.
(For simplicity, I did not detail how negative evidences are handled - they are taken into account as evidence for a statement not being true; you can see this in the code).After the prior belief calculation, the refinement graph of Statements (given an ontology) is used to propagate these beliefs from more specific to less specific statements. For instance, S1: "rainfall causes floods" is less specific than S2: "a large increase in rainfall causes a small increase in floods", and so S2's evidences will be propagated to S1's pool of evidences when calculating S1's final belief.
To your last question in 2.a: the identity of specific source documents is not taken into account in this default belief scorer. However, the belief calculation is fully configurable and you can implement your own BeliefScorer class according to your own assumptions about a probability model of statement correctness (just subclass the BeliefScorer class https://github.com/sorgerlab/indra/blob/master/indra/belief/__init__.py#L27 and pass the instance of your scorer as the belief_scorer
argument to ac.run_preassembly
).
from indra.literature import pmc_client
pmc_client.get_xml('4322985')
returns the NXML corresponding to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4322985/. If you provide more specifics on what ID's you are trying to access, I can try to help more.
As a final note, I couldn't fully determine based on your questions what knowledge sources you are using with INDRA (Eidos? or some biology-specific sources?) and what your use case is. We are happy to help more and give more specific advice tailored to your use case if needed.
Hi Benjamin, I really appreciate your fast answer. It is really helpful. To understand better our case: we are using REACH as source, just the default way. We introduced a list of pmids and generated the statements with INDRA-REACH to generate a CX network.
Question 1: Solved, thanks!
Question 2: we used one only knowledge source: REACH, so with regards to this sentence “For each source (eidos, hume, etc.)…… from that source”, r and s are the same (because all evidence is based on REACH). We are mainly puzzled about the calculation of the different pieces of evidence. So various strings are produced that are deduplicated into one statement with an overall belief score. But is it then that the more strings, the higher the belief score? Or is it: strings which are less specific have a lower weight in terms of calculating the overall belief score.
Question 3: For example, we have the respective error with these pmid among others: 21317303 and 17478545.
Thank you so much for your help!
Laura
If you are working on an REACH-INDRA-CX pipeline, I suggest running grounding mapping (fixing the grounding of entities that are often misgrounded by REACH and other readers) and sequence mapping (fixing incorrect references to amino acid sites) and some optional filters in addition to preassembly. An example assembly pipeline could be:
from indra.tools import assemble_corpus as ac
stmts = <the collection of all raw statements to use>
stmts = ac.filter_no_hypothesis(stmts) # Optional: filter out hypothetical statements
stmts = ac.map_grounding(stmts) # Map grounding
stmts = ac.filter_grounded_only(stmts) # Optional: filter out ungrounded agents
stmts = ac.map_sequence(stmts) # Map sequence
stmts = ac.run_preassembly(stmts, # Run preassembly
return_toplevel=False)
stmts = ac.filter_belief(stmts, 0.8) # Optional: apply belief cutoff of e.g., 0.8
Question 2: Yes, if all your statements come from a single source then there will be a single r and s parameter being used. The default built-in values for reach
are s=0.05
and r=0.3
(https://github.com/sorgerlab/indra/blob/master/indra/resources/default_belief_probs.json) but you can customize these values in your own assembly. So assuming an assembled statement has 5 pieces of evidence, the belief would be 1 - (0.3^5 + 0.05)
. The specific properties of the evidence texts are not taken into account in the default built-in belief model - but again you can define your own belief scorer if you'd like to change that.
Question 3: I confirmed that these two PMC IDs correspond to papers for which PMC does not provide NXML content therefore the "cannotDisseminateFormat" is expected in these cases. You may need to use the downloadable dumps of PMC to get access to some of the content that isn't available through the web service.
Dear Benjamin, First of all, I hope you had a nice Christmas. We are really grateful of your help. We take a few days to answer to be able to process everything and try to figure out things by ourselves before asking again. Our next question is regarding the enrichment of the network. I tried to reproduce the exact command in the INDRA tutorial:
from indra.sources import bel bel_processor = bel.process_pybel_neighborhood(['KRAS', 'BRAF'])
But it gives me the following error that I couldn’t figure out:
TypeError Traceback (most recent call last)
The error was due to a deprecated gene name being replaced by two new gene names by HGNC - I fixed the code to handle this corner case. I also added a test to make sure this example is monitored for any such future changes. Please open another issue if you run into other problems!
My team encountered a couple of problems and unclear concepts during our work and we would like feedback to help us understanding better.
INDRA statements: we could not find information on how de-duplication of statements is done. Every statement is supposed to have a list of evidence objects (a.k.a. list of sentences from literature?) but we are only able to retrieve one object for each statement. Thus, what happened with the rest? Does the algorithm pick one? Is it a new sentence build from an ensemble of all of the evidence objects?
Belief scores: we were not able to unravel how belief scores are calculated. a. What we know: the prior probability of each statement is calculated based on the number of evidences it has and their sources. What we want to know: How does this exactly work? What happen if two or more of the retrieved evidences come from the same source (e.g. they are citations from the same article)? b. How are the belief scores calculated and what is the relation with the prior probability (further than the INDRA documentation explains, because it is not clear for us from there)?
We could not solve the WARNING message “indra.literature.pmc_client - PMC client returned with error cannotDisseminateFormat: The meta data format ‘pmc’ is not supported by the item or by the repository”.
Thank you very much in advance for your help!
Laura