Akusem commented 3 years ago

Hi, I have seen that many data specific to species are present inside the database (notably in pathwayDB, motif, geneInfo, but also inside files in data_go).

Currently no information is available on how we can add new species to iDEP (or in a local instance) as a developer. For now we can just, as a user, add a GMT file manually in the load data tab to have the GO enrichment working.

So it would be nice to document how fully support a new species. Also while waiting for an update of the documentation, it would be appreciable to list all the data needed, so we could prepare them.

Akusem commented 3 years ago

Hi, While waiting for further instructions, I have taken a look at the data inside the database data92 in order to determine which data or file where needed to add a new species.

Below I have written the different files/db structure linked to adding a species (from my understanding) for documentation purpose.

However I have some questions, Mainly, how and where we should find some of theses data ? Notably for the Motif and PathwayDB folder.


the data structure is the following with all of them expected convertIDs being a folder:

├── data_go
│       └ ...
├── geneInfo
│       └ ...
├── motif
│       └ ...
├── pathwayDB
│       └ ...
└── convertIDs.db

geneInfo folder

Each species have an csv file with its species name with the following column about the basic gene information:




data_go folder

Is composed of 5 files, but for 3 of them, I don't know their usage.

Theses are the 2 Bcell files (BcellGSE71176_p53 & BcellGSE71176_p53_sampleInfo.csv) and the sailfish_gene_counts file (GSE37704_sailfish_genecounts.csv).

The 2 others are about KEGG and STRINGdb

KEGG species

In file KEGG_species_ID.csv the headers are:



oanatinus_gene_ensembl,Ornithorhynchus anatinus genes (OANA5),oaa


Phaeodactylum tricornutum already setup.

Motif folder

Is composed of mutiple SQLite db, 2 by species, ending with: *_ensembl_TF_Info_{6,3}00.db

Each db contain two tables, TF_Information and scores see linked tsv extracted from acarolinensis db

TF_Information table


Have the following header:

ID  TF_ID   Family_ID   TSource_ID  Motif_ID    MSource_ID  DBID    TF_Name TF_Species  TF_Status   Family_Name DBDs    DBD_Count   Cutoff  DBID.1  Motif_Type  MSource_Identifier  MSource_Type    MSource_Author  MSource_Year    PMID    MSource_Version TfSource_Name   TfSource_URL    TfSource_Year   TfSource_Month  TfSource_Day    consensus   coreMotif   capital memo    scoreMean   scoreSD nGenes


M0104_1.02___ARID3B T007410_1.02    F039_1.02   TS19_1.02   M0104_1.02  MS29_1.02   ENSG00000179361 ARID3B  Homo_sapiens    I   ARID/BRIGHT ARID    1   0.65    pTH4425 PBM Zoo_01  PBM Weirauch    2014    25215497    NULL    Ensembl 2011    Oct 26  ATATTAATTAA TATTAAT aTATTAATtaa ARID/BRIGHT family Transcription Factor ARID3B, motif:TATTAAT   555.739049773756    76.0830936373455    22100

scores table


The table use the motif ID (TF_information id column) as column and gene as row, with a score. example:

row_names   M0082_1.02___Tcfap2a    M0083_1.02___Tcfap2b
ENSG00000004059 580.0   588.0
ENSG00000001630 575.0   648.0

PathwayDB folder

Is composed of mutiple SQLite db, one by species. It is composed of 3 tables, categories, pathway, pathwayInfo How theses table are created ?

From what I'm seeing, their creation is certainly linked with the R files found in PathwayDB folder, but I'm not sure how to adapt them for Phaeodactylum tricornutum.


Is an SQLite db containing 5 tables.

At least 2 seem to be linked with adding a species: orgInfo and mapping see tsv linked



Contains basic info about the species, the Headers are :

ensembl_dataset name    name2   idType  idCode  id  totalGenes  group

example of species:

pcoquereli_gene_ensembl Coquerels sifaka genes (Pcoq_1.0)   Coquerels sifaka    ensembl_gene_id ens 1   17884.0 Ensembl




id  ens species idType

Species value correspond to id column in orgInfo


AERX01031199    ENSONIG00000013256  30.0    8
gexijin commented 2 years ago

