Change curation from PRo to Ensembl/NCBIgene

shawntanzk commented 3 years ago

Currently all curation of allen markers (expresses column) uses PRo, instead they should be using gene IDs based on what marker/gtf file is provided. @shawntanzk to change curation to whichever gene ID is used for that specific taxonomy (probably Ensembl for mouse, NCBIgene for human and marmoset - need to double check on this. @hkir-dev to add NCBIgene to curie map using identifiers.org patterns.

shawntanzk commented 3 years ago

I will do this on a seperate branch based on cross-species - this way it will allow us to already have something to show (the PR curation is basically done). Will then make a PR into cross-species where hopefully there wouldn't be too much clashes.

shawntanzk commented 3 years ago

Will wait to obtain gene files (Ray is working on this) - once I get them I will use them to curate based on whatever they are based on. Will start with mouse based on https://raw.githubusercontent.com/obophenotype/brain_data_standards_ontologies/cross_species/src/templates/ensmusg.tsv

shawntanzk commented 3 years ago

List of names for curiemap (will update this comment as I go along): Mouse - ensembl: Human - ensembl: Marmoset - entrez: (NCBI gene)

shawntanzk commented 3 years ago

I will do this on a seperate branch based on cross-species

Mouse: https://github.com/obophenotype/brain_data_standards_ontologies/blob/gene-curation/src/patterns/data/default/CCN202002013_class_curation.tsv

shawntanzk commented 3 years ago

I've uploaded a simplified tsv of the gtf with information I think we need (removing those we don't) gtf files are too big to upload, tsv from there directly are hence also too big, hence I had to remove a whole lot of columns. tsv files are here: https://github.com/obophenotype/brain_data_standards_ontologies/commit/76c63a801c5f22cc90bf869a62b4fbfc91321641 For transparency sake (and in case I screwed up the code, esp removing duplicate terms in marmoset), here is the R code I used to do it: https://github.com/obophenotype/brain_data_standards_ontologies/blob/gene-curation/src/templates/gtf%20to%20tsv.R (makes no sense if you don't have the gtf files and of course the directories are set to my desktop) Ray is working on getting a stable link for the gene files (he is currently using his own google drive, but they are working out with the institute on getting a paid one on their ends cause they do not have space at the moment - once that is done, I will put links to them in this repo too. In the meantime, happy to send to whoever who needs it.

dosumis commented 3 years ago

For our purposes I think it's enough to have gene name and gene ID (and build/version details for the whole file). This can easily be use to build a robot template for producing OWL for the pipeline. See https://raw.githubusercontent.com/obophenotype/brain_data_standards_ontologies/master/src/templates/ensmusg.tsv

shawntanzk commented 3 years ago

I think it's enough to have gene name and gene ID

Will run the extraction only taking those two columns and remove any duplicate, then edit them in a tsv similar to the one you linked.

shawntanzk commented 3 years ago

changes made here: https://github.com/obophenotype/brain_data_standards_ontologies/commit/e7b431323266f8d95065e872716012af1f911bf1

shawntanzk commented 3 years ago

problem that might occur when automating NS-forest to Ensembl/Entrez - some gene names use synonyms eg Inh L1 LAMP5 PVRL2 -> PVRL2 = NECTIN2 which in the gtf file is how it is shown Possible solution (if there ends up being a problem with automated) is to push the IDs into something like biomart and get the synonyms? (think that is possible at least, I rmbr biomart stuff having synonyms) - for now since I'm manually curating the allen markers, ill manually search.

shawntanzk commented 3 years ago

Issues (will continue updating this as I move along): Note: general trend of pseudogenes not being in the gtf file Note: MARCHF5 (ENSG00000198060 - not 100% sure what was the name used before) got converted to a date in excel (or maybe R) - have to be careful of this -> have changed to MARCHF5 to be safe

PVRL2 synonym for NECTIN2 -> left comment on curation file
MIR101-1 (ENSG00000199135) not in gtf file -> I've manually added in the tsv file
Frem2 (ENSMUSG00000037016) not in gtf file -> I've manually added in the tsv file
CD27-AS1 (ENSG00000215039) not in gtf file -> I've manually added in the tsv file
UG0898H09 synonym for NKAIN3 -> left comment on curation file
RPL35AP11 (ENSG00000241103) not in gtf file -> I've manually added in the tsv file
OR5AH1P (ENSG00000268067) not in gtf file -> I've manually added in the tsv file
C4orf26 synonym for ODAPH -> left comment on curation file
ZFPM2-AS1 (ENSG00000251003) not in gtf file -> I've manually added in the tsv file [ZFPM2 is in the gtf file though]
GAPDHP60 (ENSG00000248180) not in gtf file -> I've manually added in the tsv file
FAM150B synonym for ALKAL2 -> left comment on curation file
CARM1P1 (ENSG00000227835) not in gtf file -> I've manually added in the tsv file
LINC00343 (ENSG00000226620) not in gtf file -> I've manually added in the tsv file
C9orf135-AS1 synonym for C9orf135-DT -> left comment on curation file
RNF144A-AS1 (ENSG00000228203) not in gtf file -> I've manually added in the tsv file [RNF144A is in the gtf file though]
FTH1P3 (ENSG00000213453) not in gtf file -> I've manually added in the tsv file

dosumis commented 3 years ago

Hmmm - frustrating to see these not lining up.

Part of spec for taxonomies should be to provide IDs and version info for all genes used in names.

shawntanzk commented 3 years ago

Completed in gene_curation branch - still requires implementations and checks

shawntanzk commented 3 years ago

implemented

obophenotype / brain_data_standards_ontologies

Change curation from PRo to Ensembl/NCBIgene #178