Closed shawntanzk closed 3 years ago
I will do this on a seperate branch based on cross-species - this way it will allow us to already have something to show (the PR curation is basically done). Will then make a PR into cross-species where hopefully there wouldn't be too much clashes.
Will wait to obtain gene files (Ray is working on this) - once I get them I will use them to curate based on whatever they are based on. Will start with mouse based on https://raw.githubusercontent.com/obophenotype/brain_data_standards_ontologies/cross_species/src/templates/ensmusg.tsv
List of names for curiemap (will update this comment as I go along): Mouse - ensembl: Human - ensembl: Marmoset - entrez: (NCBI gene)
I will do this on a seperate branch based on cross-species
I've uploaded a simplified tsv of the gtf with information I think we need (removing those we don't) gtf files are too big to upload, tsv from there directly are hence also too big, hence I had to remove a whole lot of columns. tsv files are here: https://github.com/obophenotype/brain_data_standards_ontologies/commit/76c63a801c5f22cc90bf869a62b4fbfc91321641 For transparency sake (and in case I screwed up the code, esp removing duplicate terms in marmoset), here is the R code I used to do it: https://github.com/obophenotype/brain_data_standards_ontologies/blob/gene-curation/src/templates/gtf%20to%20tsv.R (makes no sense if you don't have the gtf files and of course the directories are set to my desktop) Ray is working on getting a stable link for the gene files (he is currently using his own google drive, but they are working out with the institute on getting a paid one on their ends cause they do not have space at the moment - once that is done, I will put links to them in this repo too. In the meantime, happy to send to whoever who needs it.
For our purposes I think it's enough to have gene name and gene ID (and build/version details for the whole file). This can easily be use to build a robot template for producing OWL for the pipeline. See https://raw.githubusercontent.com/obophenotype/brain_data_standards_ontologies/master/src/templates/ensmusg.tsv
I think it's enough to have gene name and gene ID
Will run the extraction only taking those two columns and remove any duplicate, then edit them in a tsv similar to the one you linked.
problem that might occur when automating NS-forest to Ensembl/Entrez - some gene names use synonyms eg Inh L1 LAMP5 PVRL2 -> PVRL2 = NECTIN2 which in the gtf file is how it is shown Possible solution (if there ends up being a problem with automated) is to push the IDs into something like biomart and get the synonyms? (think that is possible at least, I rmbr biomart stuff having synonyms) - for now since I'm manually curating the allen markers, ill manually search.
Issues (will continue updating this as I move along): Note: general trend of pseudogenes not being in the gtf file Note: MARCHF5 (ENSG00000198060 - not 100% sure what was the name used before) got converted to a date in excel (or maybe R) - have to be careful of this -> have changed to MARCHF5 to be safe
Hmmm - frustrating to see these not lining up.
Part of spec for taxonomies should be to provide IDs and version info for all genes used in names.
Completed in gene_curation branch - still requires implementations and checks
implemented
Currently all curation of allen markers (expresses column) uses PRo, instead they should be using gene IDs based on what marker/gtf file is provided. @shawntanzk to change curation to whichever gene ID is used for that specific taxonomy (probably Ensembl for mouse, NCBIgene for human and marmoset - need to double check on this. @hkir-dev to add NCBIgene to curie map using identifiers.org patterns.