Closed Akusem closed 2 years ago
Hi,
While waiting for further instructions, I have taken a look at the data inside the database data92
in order to determine which data or file where needed to add a new species.
Below I have written the different files/db structure linked to adding a species (from my understanding) for documentation purpose.
However I have some questions, Mainly, how and where we should find some of theses data ? Notably for the Motif and PathwayDB folder.
the data structure is the following with all of them expected convertIDs being a folder:
.
├── data_go
│ └ ...
├── geneInfo
│ └ ...
├── motif
│ └ ...
├── pathwayDB
│ └ ...
└── convertIDs.db
Each species have an csv file with its species name with the following column about the basic gene information:
"ensembl_gene_id","band","chromosome_name","start_position","percentage_gc_content","transcript_count","gene_biotype","genomeSpan","cds_length","transcript_length","FiveUTR","ThreeUTR","nExons","symbol"
example:
"ACYPI000001",NA,"GL349621",2365251,32.18,1,"protein_coding",8930,2922,3794,442,428,12,""
Is composed of 5 files, but for 3 of them, I don't know their usage.
Theses are the 2 Bcell files (BcellGSE71176_p53
& BcellGSE71176_p53_sampleInfo.csv
) and the sailfish_gene_counts file (GSE37704_sailfish_genecounts.csv
).
The 2 others are about KEGG and STRINGdb
KEGG species
In file KEGG_species_ID.csv the headers are:
ensembl_dataset,name,kegg
example:
oanatinus_gene_ensembl,Ornithorhynchus anatinus genes (OANA5),oaa
STRING DB
Phaeodactylum tricornutum already setup.
Is composed of mutiple SQLite db, 2 by species, ending with:
*_ensembl_TF_Info_{6,3}00.db
Each db contain two tables, TF_Information
and scores
see linked tsv extracted from acarolinensis db
TF_Information table
Have the following header:
ID TF_ID Family_ID TSource_ID Motif_ID MSource_ID DBID TF_Name TF_Species TF_Status Family_Name DBDs DBD_Count Cutoff DBID.1 Motif_Type MSource_Identifier MSource_Type MSource_Author MSource_Year PMID MSource_Version TfSource_Name TfSource_URL TfSource_Year TfSource_Month TfSource_Day consensus coreMotif capital memo scoreMean scoreSD nGenes
example:
M0104_1.02___ARID3B T007410_1.02 F039_1.02 TS19_1.02 M0104_1.02 MS29_1.02 ENSG00000179361 ARID3B Homo_sapiens I ARID/BRIGHT ARID 1 0.65 pTH4425 PBM Zoo_01 PBM Weirauch 2014 25215497 NULL Ensembl http://www.ensembl.org/ 2011 Oct 26 ATATTAATTAA TATTAAT aTATTAATtaa ARID/BRIGHT family Transcription Factor ARID3B, motif:TATTAAT 555.739049773756 76.0830936373455 22100
scores table
The table use the motif ID (TF_information id column) as column and gene as row, with a score. example:
row_names M0082_1.02___Tcfap2a M0083_1.02___Tcfap2b
ENSG00000004059 580.0 588.0
ENSG00000001630 575.0 648.0
Is composed of mutiple SQLite db, one by species. It is composed of 3 tables, categories
, pathway
, pathwayInfo
How theses table are created ?
From what I'm seeing, their creation is certainly linked with the R files found in PathwayDB folder, but I'm not sure how to adapt them for Phaeodactylum tricornutum.
Is an SQLite db containing 5 tables.
At least 2 seem to be linked with adding a species: orgInfo
and mapping
see tsv linked
orgInfo
Contains basic info about the species, the Headers are :
ensembl_dataset name name2 idType idCode id totalGenes group
example of species:
pcoquereli_gene_ensembl Coquerels sifaka genes (Pcoq_1.0) Coquerels sifaka ensembl_gene_id ens 1 17884.0 Ensembl
mapping
Header:
id ens species idType
Species value correspond to id column in orgInfo
Example:
AERX01031199 ENSONIG00000013256 30.0 8
Sorry for my late response. Jenny is actually working on adding new species. She has some script and can help you add new species if you have pathway data, annotation. She added a few species http://bioinformatics.sdstate.edu/idepc/
Hi, I have seen that many data specific to species are present inside the database (notably in
pathwayDB
,motif
,geneInfo
, but also inside files indata_go
).Currently no information is available on how we can add new species to iDEP (or in a local instance) as a developer. For now we can just, as a user, add a GMT file manually in the
load data
tab to have the GO enrichment working.So it would be nice to document how fully support a new species. Also while waiting for an update of the documentation, it would be appreciable to list all the data needed, so we could prepare them.