Open complexbrains opened 2 years ago
Might be solved by @FrancoisPgm PR, @complexbrains feel free to close here after testing 😉
@complexbrains here are some suggestions:
def main(args):
becomes something like def co_occ_table
and only counts (paper by paper) occurances of all words in that vocabulary + their sum (column called "sum_selected_terms"
) + total number of words in that paper (called "total_#_words"
)"total_#_words"
and the column "sum_selected_terms"
and assigned the label (in the column is_clinical
) if percentage of clinical words is => 10% - "total_#_words"
*"sum_selected_terms"
/ 100 => 10filter_clinical2.py
topapers_stats.py
to reflect both usagesAnd we probably want to remove this to avoid misunderstandings: https://github.com/neurocausal/neurocausal_data/blob/main/clinical_filter_keywords.csv
@complexbrains the very last version of the filtering code is the one here uploaded? if not, could you make sure it is?
Possible improvements to restrict to human clinical usable data from Pedro:
For the next filter consider counting the words “brain, cortex, subcortical” in the text and their proportion relative to the other terms. If the proportion is too low, like 1/100 we exclude the paper, because it will not contain any brain information.
I have created a draft pull request for this issue which hopefully acts as an initial step to cover exclusion criteria.
Hello,
Just a thought but I'm not sure if this would be feasible. If a particular paper is on humans, it can certainly mention animal studies in their introduction and/or discussion and vice versa. So how about scanning through particularly the methods and the results sections of papers and eliminating papers that have the exclusion criteria terms (i.e. "mouse", "rat", "optogenetics" etc.) in the methods and results sections only?
The process requires filtering the data into a set of papers that are clinically relevant. The data to be filtered consists of two different data sets:
Inherited data: The set of papers downloaded and used in NeuroQuery project that contains neuroimaging papers from the diverse research domain.
NeuroCausal data: This set of papers was downloaded from PubMed by using specific clinical queries.
The data is structured as:
The filtering process will use the clinical keywords based on the discussions here.