obophenotype / cell-ontology

An ontology of cell types
https://obophenotype.github.io/cell-ontology/
Creative Commons Attribution 4.0 International
134 stars 48 forks source link

[Design pattern] Add NS-forest markers #2397

Open aleixpuigb opened 2 weeks ago

aleixpuigb commented 2 weeks ago

Context NS-Forest algorithm identifies minimal combinations of marker genes that can accurately classify and define cell types from single-cell RNA sequencing datasets. We need a design pattern to add these markers to existing cells (Logically and textually).

Tasks

Comments Markers should are genes, so they shouldn't be PR terms. (EDIT: They should not be PR term)

dosumis commented 2 weeks ago

We should follow the design pattern in this paper - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9873614/ - or something v.close to it. This will work well with Knowledge Graphs and corresponds to the pattern used in HubMap.

For examples, see PCL & also pattern here: https://github.com/obophenotype/brain_data_standards_ontologies/blob/master/src/patterns/dosdp-patterns/taxonomy_marker_set.yaml

dosumis commented 2 weeks ago

Markers should be PR terms.

I think this is problematic - we don't strictly know if the proteins are markers - just the transcripts. We need a general policy on how to refer to genes. In this case maybe HGNC IDs are sufficient as lung datasets are human. NCBI or Ensembl also possible. @cmungall @scheuerm - comments?

lubianat commented 2 weeks ago

My 3 cents:

There are also microRNA and splicing variants that are cell-type specific, so maybe in the long run we should have ways to consider that?

I feel almost like the "cell type marker" relation is more about gene annotations than species-neutral cell-type annotations, similar to the way GO does gene curation (https://geneontology.org/docs/go-annotations/) using Evidence Codes for details.

By the way, on Wikidata, we use the "has marker" property to link species-specific cell terms to species-specific terms, but I don't think this design pattern is usable around here.

dosumis commented 2 weeks ago

There are also microRNA and splicing variants that are cell-type specific, so maybe in the long run we should have ways to consider that?

Yep. And we have cases of lnRNA in NS-Forest markers.

scheuerm commented 2 weeks ago

In the past, we used HGNC IDs, but is there not a true "gene ontology" that we might want to consider. Having said that, we have tired to be explicit that we are referring to gene transcripts when talking about these cell type biomarkers. In theory, we could use Ensemble IDs for these. However, the reference-based alignment methods that we use to produce the count data are not great at distinguishing between different alternative splicing isoforms. And so, although we do have mappings to Ensemble IDs, I would be concerned about their accuracy.

cmungall commented 2 weeks ago

I think we should try and pick a single ID space per species for genes/proteins. This should be based on bioinformatics considerations. Of course genes, transcripts, proteins and other products are ontologically distinct, but we should use mechanisms such as the predicate to determine which aspect we are referring to. How common is it to need to refer to isoforms? Would a policy of ENSG and then ENST if isoform specificity is required work?

scheuerm commented 2 weeks ago

Until we start to see the use of long read technologies in the scRNAseq space, reliable splice isoform data will probably not be available. Yes, I think ENSG and then ENST if isoform specificity is required would work