iDEP-SDSU / idep

Integrated Differential Expression and Pathway analysis
http://ge-lab.org/idep
123 stars 61 forks source link

Document how to add a new species to the database #131

Closed Akusem closed 2 years ago

Akusem commented 3 years ago

Hi, I have seen that many data specific to species are present inside the database (notably in pathwayDB, motif, geneInfo, but also inside files in data_go).

Currently no information is available on how we can add new species to iDEP (or in a local instance) as a developer. For now we can just, as a user, add a GMT file manually in the load data tab to have the GO enrichment working.

So it would be nice to document how fully support a new species. Also while waiting for an update of the documentation, it would be appreciable to list all the data needed, so we could prepare them.

Akusem commented 3 years ago

Hi, While waiting for further instructions, I have taken a look at the data inside the database data92 in order to determine which data or file where needed to add a new species.

Below I have written the different files/db structure linked to adding a species (from my understanding) for documentation purpose.

However I have some questions, Mainly, how and where we should find some of theses data ? Notably for the Motif and PathwayDB folder.

DataStructure

the data structure is the following with all of them expected convertIDs being a folder:

.
├── data_go
│       └ ...
├── geneInfo
│       └ ...
├── motif
│       └ ...
├── pathwayDB
│       └ ...
└── convertIDs.db

geneInfo folder

Each species have an csv file with its species name with the following column about the basic gene information:

"ensembl_gene_id","band","chromosome_name","start_position","percentage_gc_content","transcript_count","gene_biotype","genomeSpan","cds_length","transcript_length","FiveUTR","ThreeUTR","nExons","symbol"

example:

"ACYPI000001",NA,"GL349621",2365251,32.18,1,"protein_coding",8930,2922,3794,442,428,12,""

data_go folder

Is composed of 5 files, but for 3 of them, I don't know their usage.

Theses are the 2 Bcell files (BcellGSE71176_p53 & BcellGSE71176_p53_sampleInfo.csv) and the sailfish_gene_counts file (GSE37704_sailfish_genecounts.csv).

The 2 others are about KEGG and STRINGdb

KEGG species

In file KEGG_species_ID.csv the headers are:

ensembl_dataset,name,kegg

example:

oanatinus_gene_ensembl,Ornithorhynchus anatinus genes (OANA5),oaa

STRING DB

Phaeodactylum tricornutum already setup.

Motif folder

Is composed of mutiple SQLite db, 2 by species, ending with: *_ensembl_TF_Info_{6,3}00.db

Each db contain two tables, TF_Information and scores see linked tsv extracted from acarolinensis db

TF_Information table

motif_TF_Information.csv

Have the following header:

ID  TF_ID   Family_ID   TSource_ID  Motif_ID    MSource_ID  DBID    TF_Name TF_Species  TF_Status   Family_Name DBDs    DBD_Count   Cutoff  DBID.1  Motif_Type  MSource_Identifier  MSource_Type    MSource_Author  MSource_Year    PMID    MSource_Version TfSource_Name   TfSource_URL    TfSource_Year   TfSource_Month  TfSource_Day    consensus   coreMotif   capital memo    scoreMean   scoreSD nGenes

example:

M0104_1.02___ARID3B T007410_1.02    F039_1.02   TS19_1.02   M0104_1.02  MS29_1.02   ENSG00000179361 ARID3B  Homo_sapiens    I   ARID/BRIGHT ARID    1   0.65    pTH4425 PBM Zoo_01  PBM Weirauch    2014    25215497    NULL    Ensembl http://www.ensembl.org/ 2011    Oct 26  ATATTAATTAA TATTAAT aTATTAATtaa ARID/BRIGHT family Transcription Factor ARID3B, motif:TATTAAT   555.739049773756    76.0830936373455    22100

scores table

motif_100_scores.csv

The table use the motif ID (TF_information id column) as column and gene as row, with a score. example:

row_names   M0082_1.02___Tcfap2a    M0083_1.02___Tcfap2b
ENSG00000004059 580.0   588.0
ENSG00000001630 575.0   648.0

PathwayDB folder

Is composed of mutiple SQLite db, one by species. It is composed of 3 tables, categories, pathway, pathwayInfo How theses table are created ?

From what I'm seeing, their creation is certainly linked with the R files found in PathwayDB folder, but I'm not sure how to adapt them for Phaeodactylum tricornutum.

convertIDs.db

Is an SQLite db containing 5 tables.

At least 2 seem to be linked with adding a species: orgInfo and mapping see tsv linked

orgInfo

convertIDs_orgInfo.csv

Contains basic info about the species, the Headers are :

ensembl_dataset name    name2   idType  idCode  id  totalGenes  group

example of species:

pcoquereli_gene_ensembl Coquerels sifaka genes (Pcoq_1.0)   Coquerels sifaka    ensembl_gene_id ens 1   17884.0 Ensembl

mapping

convertIDs_mapping.csv

Header:

id  ens species idType

Species value correspond to id column in orgInfo

Example:

AERX01031199    ENSONIG00000013256  30.0    8
gexijin commented 2 years ago

Sorry for my late response. Jenny is actually working on adding new species. She has some script and can help you add new species if you have pathway data, annotation. She added a few species http://bioinformatics.sdstate.edu/idepc/