neurocausal / neurocausal_meta

the code base of neurocausal meta-analysis platform
BSD 3-Clause "New" or "Revised" License
1 stars 7 forks source link

Filtering the data to the clinical cohort #11

Open complexbrains opened 2 years ago

complexbrains commented 2 years ago

The process requires filtering the data into a set of papers that are clinically relevant. The data to be filtered consists of two different data sets:

  1. Inherited data: The set of papers downloaded and used in NeuroQuery project that contains neuroimaging papers from the diverse research domain.

  2. NeuroCausal data: This set of papers was downloaded from PubMed by using specific clinical queries.

The data is structured as:

query-aphasia_neurodegenerative.zip
  └── query-aphasia_neurodegenerative
      ├── articlesets
      │   ├── articleset_00000.xml
      │   └── info.json
      └── articles
          ├── 000
          │   └──pmcid_4382926.xml
          ├── 00a
          │   └──pmcid_8317687.xml
          ├── 00b
          │   └── pmcid_6625472.xml
          ├── ...
          ├── e12
          │   ├── pmcid_8832765.xml
          ├── ...
          │
          └── info.json
          └── subset_allArticles_extractedData
                 ├── authors.csv
                 ├── coordinates.csv
                 ├── info.json
                 ├── metadata.csv
                 └── text.csv

The filtering process will use the clinical keywords based on the discussions here.

vborghe commented 2 years ago

Might be solved by @FrancoisPgm PR, @complexbrains feel free to close here after testing 😉

vborghe commented 2 years ago

@complexbrains here are some suggestions:

vborghe commented 2 years ago

And we probably want to remove this to avoid misunderstandings: https://github.com/neurocausal/neurocausal_data/blob/main/clinical_filter_keywords.csv

vborghe commented 1 year ago

@complexbrains the very last version of the filtering code is the one here uploaded? if not, could you make sure it is?

Possible improvements to restrict to human clinical usable data from Pedro:

For the next filter consider counting the words “brain, cortex, subcortical” in the text and their proportion relative to the other terms. If the proportion is too low, like 1/100 we exclude the paper, because it will not contain any brain information.

martinesparza commented 1 year ago

I have created a draft pull request for this issue which hopefully acts as an initial step to cover exclusion criteria.

nazdoganci commented 1 year ago

Hello,

Just a thought but I'm not sure if this would be feasible. If a particular paper is on humans, it can certainly mention animal studies in their introduction and/or discussion and vice versa. So how about scanning through particularly the methods and the results sections of papers and eliminating papers that have the exclusion criteria terms (i.e. "mouse", "rat", "optogenetics" etc.) in the methods and results sections only?