Validation of literature-curated SCI genes

kcmtest commented 4 years ago

Fantastic paper I was looking for WGCNA analysis for some of my work I found your paper.

I was interested in this literature curation methods for genes associated with SCI. I see your data which you have submitted which you did all the data cleaning etc. The data https://raw.githubusercontent.com/skinnider/spinal-cord-injury-elife-2018/master/data/literature_review/Table_S1.txt.

I would be really glad to know how did you curate and made this? Since I mostly or almost use R. It would be really helpful if you can give me some suggestion how to proceed some R based approach.

What i have done so far is I have used Europmc to scrape and aggregate the data .But if I have to make the data uniform like the way you have done in your paper how do I do it?

skinnider commented 4 years ago

Hi @krushnach80: this data was compiled by a manual literature review, not an automated approach. So all the rows and columns were entered manually. The relevant Methods section from our paper is:

We searched PubMed for articles investigating the molecular pathophysiology of SCI published prior to February 2016, using combinations of ‘spinal cord injury’ and one of ‘proteomics,’ ‘proteome,’ ‘proteomic,’ ‘biomarkers,’ ‘biomarker,’ ‘RNA-seq,’ and ‘microarray’ as search terms. 556 papers were identified that met these criteria. These were subsequently filtered to exclude papers that did not include a valid control group, included exclusively in vitro data, did not include primary data, or examined a tissue other than spinal cord. As previous studies have suggested that small-scale and high-throughput experiments may be largely complementary, or lead to divergent biological conclu- sions, we considered only small-scale experiments in the literature curation process, defined here as experiments reporting differential regulation of fewer than 100 genes or proteins. Ultimately, data from 67 manuscripts was collected. The original accessions used to identify genes or proteins associated with SCI in each publication were retained. If only the gene name and no unambiguous identifier was noted, the UniProt accession of the gene in the relevant species was manually retrieved. We applied a strict, majority voting-based method to map rat, mouse, and rabbit genes to their human orthologs with maximum accuracy (Li et al., 2017). Specifically, we mapped ortho- logs from rat, mouse, and rabbit genes to human using seven different ortholog databases [Egg- NOG (Huerta-Cepas et al., 2016), Ensembl (Kinsella et al., 2011), NCBI Gene (Brown et al., 2015), HomoloGene (Agarwala et al., 2018), InParanoid (Sonnhammer and O ̈stlund, 2015), and OrthoDB (Zdobnov et al., 2017)], and considered human genes as ‘consensus orthologs’ only if they were detected in at least half of those databases containing an entry for the target model organism protein. All genes were mapped to Ensembl identifiers in Bioconductor (Huber et al., 2015).

kcmtest commented 4 years ago

Hi @krushnach80: this data was compiled by a manual literature review, not an automated approach. So all the rows and columns were entered manually. The relevant Methods section from our paper is:

We searched PubMed for articles investigating the molecular pathophysiology of SCI published prior to February 2016, using combinations of ‘spinal cord injury’ and one of ‘proteomics,’ ‘proteome,’ ‘proteomic,’ ‘biomarkers,’ ‘biomarker,’ ‘RNA-seq,’ and ‘microarray’ as search terms. 556 papers were identified that met these criteria. These were subsequently filtered to exclude papers that did not include a valid control group, included exclusively in vitro data, did not include primary data, or examined a tissue other than spinal cord. As previous studies have suggested that small-scale and high-throughput experiments may be largely complementary, or lead to divergent biological conclu- sions, we considered only small-scale experiments in the literature curation process, defined here as experiments reporting differential regulation of fewer than 100 genes or proteins. Ultimately, data from 67 manuscripts was collected. The original accessions used to identify genes or proteins associated with SCI in each publication were retained. If only the gene name and no unambiguous identifier was noted, the UniProt accession of the gene in the relevant species was manually retrieved. We applied a strict, majority voting-based method to map rat, mouse, and rabbit genes to their human orthologs with maximum accuracy (Li et al., 2017). Specifically, we mapped ortho- logs from rat, mouse, and rabbit genes to human using seven different ortholog databases [Egg- NOG (Huerta-Cepas et al., 2016), Ensembl (Kinsella et al., 2011), NCBI Gene (Brown et al., 2015), HomoloGene (Agarwala et al., 2018), InParanoid (Sonnhammer and O ̈stlund, 2015), and OrthoDB (Zdobnov et al., 2017)], and considered human genes as ‘consensus orthologs’ only if they were detected in at least half of those databases containing an entry for the target model organism protein. All genes were mapped to Ensembl identifiers in Bioconductor (Huber et al., 2015).

wow..thats some curation ..I will go with your approach meanwhile i would like to ask about the PPI part how did you parse these interaction from those 4 interaction database you have deposited? this data https://github.com/skinnider/spinal-cord-injury-elife-2018/blob/master/data/PPI/PPI-LCC.txt

skinnider commented 4 years ago

The code to generate that is at R/PPI/analyze-PPI-LCC.R.

skinnider / spinal-cord-injury-elife-2018

Validation of literature-curated SCI genes #1