Documentation for defining cells in PCL

shawntanzk commented 2 years ago

We need documentation on what goes into PCL (as opposed to CL)

eg If there isnt coherent set of properties and the cell type is grouped just by transcriptomics

dosumis commented 2 years ago

Criteria from Damien GG / FBbt:

A cluster is not necessarily a cell type. Do not create a new term for each cluster in a scRNAseq study!
Be mindful of the distinction between a cell state and a cell type.
- E.g. “latent hemocytes” and “activated hemocytes” - Both clusters should be annotated as “hemocytes” – no new term is needed.
Only explicit claims should be considered.
- No new term for a newly found cell type that is only casually mentioned in a figure legend.
The claimed new cell type should be characterized by more than a mere transcriptional profile.
- No new term if the only thing known about those cells is that they have a high expression of CG12345.
The claim should ideally be backed by more evidence than just the scRNAseq experiment. e,g.
- Trajectory maps to specific developmental origin.
- Direct evidence for function of cell type
- staining confirms the presence of cells expressing a given marker in tissues & identifies additional distinguishing criteria, e.g. distinct location and morphology (any set of distinguishing markers derived from scRNAseq is likely to stain something)
When in doubt, be conservative.

scheuerm commented 2 years ago

See comments inline below

Richard H. Scheuermann, Ph.D. Director, La Jolla Campus J. Craig Venter Institute 4120 Capricorn Ln. La Jolla, CA 92037

@.**@.> 858-200-1876

www.jcvi.orghttp://www.jcvi.org https://www.jcvi.org/about/richard-scheuermann

From: David Osumi-Sutherland @.> Reply-To: obophenotype/provisional_cell_ontology @.> Date: Wednesday, January 19, 2022 at 9:08 AM To: obophenotype/provisional_cell_ontology @.> Cc: Subscribed @.> Subject: Re: [obophenotype/provisional_cell_ontology] Documentation for defining cells in PCL (Issue #11)

Criteria from Damien GG / FBbt: · A cluster is not necessarily a cell type. Do not create a new term for each cluster in a scRNAseq study! This is the whole point of having a provisional ontology · Be mindful of the distinction between a cell state and a cell type. o E.g. “latent hemocytes” and “activated hemocytes” - Both clusters should be annotated as “hemocytes” – no new term is needed. Are there any definitions of cell types and cell states that provide guidance about this distinction. We made a proposal in our paper (PMID: 29322913) but I don’t know if there has been general agreement about these distinctions.

Cell types versus cell states - Our intuition is that there is a distinction between discrete cell types that might be generated as a result of programmed differentiation and more subtle changes in cell states experienced by a given cell type in response to changes in its environment. The challenge is to come up with a coherent and consistent approach for making this distinction. Although new cell types and new cell states reflect phenotypic changes that occur through temporal processes, we propose that the distinction relates to the stability and reversibility of the new cellular phenotype. Thus, the generation of a distinct cell type through the process of programmed differentiation is not only stable but also irreversible under normal circumstances. In contrast, a change in cell state is only stable in a certain environment and is reversible with a change in that environment. As an example, the transition from a naïve to memory T cell is an example of a change in cell type through differentiation, in that it reflects a stable and irreversible change (once you’ve experienced antigen, there’s no going back). In contrast, activating a memory T cell in response to antigen exposure would be considered a change in state, in that once the stimulus has been eliminated, the memory T cell would return back to its initial state. Thus, an activated memory T cell would be considered a change in state of a memory T cell rather than a new cell type. · Only explicit claims should be considered. o No new term for a newly found cell type that is only casually mentioned in a figure legend. · The claimed new cell type should be characterized by more than a mere transcriptional profile. o No new term if the only thing known about those cells is that they have a high expression of CG12345. That’s not exactly what we are doing, but this is the whole point of having a provisional ontology. I agree that this criteria should be used for migrating from PCL to CL. · The claim should ideally be backed by more evidence than just the scRNAseq experiment. e,g. o Trajectory maps to specific developmental origin. o Direct evidence for function of cell type o staining confirms the presence of cells expressing a given marker in tissues & identifies additional distinguishing criteria, e.g. distinct location and morphology (any set of distinguishing markers derived from scRNAseq is likely to stain something) Not sure I agree with this for PCL. · When in doubt, be conservative.

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fobophenotype%2Fprovisional_cell_ontology%2Fissues%2F11%23issuecomment-1016679548&data=04%7C01%7Crscheuermann%40jcvi.org%7C61592d12ab04434c29c808d9db6e60be%7C24d967f13ed84448baa6560ec572acb3%7C0%7C0%7C637782089393515428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=10teVBMQt3eYfFyryzk%2BMc%2FRg8P%2BpP15vGnSyQxhwIs%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABPP4NDZZGY2VINETO6GQA3UW3V2RANCNFSM5JUFCVRA&data=04%7C01%7Crscheuermann%40jcvi.org%7C61592d12ab04434c29c808d9db6e60be%7C24d967f13ed84448baa6560ec572acb3%7C0%7C0%7C637782089393515428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qQdEfJaJm4CPhTWK7ojt%2Fztz6v9Z%2F%2FxI9z6CCpsgHTE%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Crscheuermann%40jcvi.org%7C61592d12ab04434c29c808d9db6e60be%7C24d967f13ed84448baa6560ec572acb3%7C0%7C0%7C637782089393515428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=g%2B%2FTP3mbUttM6QD64H3hpKAR2Lkh1JhSfh2DGxVujGo%3D&reserved=0 or Androidhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Crscheuermann%40jcvi.org%7C61592d12ab04434c29c808d9db6e60be%7C24d967f13ed84448baa6560ec572acb3%7C0%7C0%7C637782089393515428%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8pqk72hk1uljlQgBeMWH8B3WBn%2BUPkBWBazppfSIcbI%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

shawntanzk commented 2 years ago

@scheuerm - just to provide some context, those criteria are a starting point (based off FBbt) for what we discussed in the CL call on what should be in CL (as opposed to remaining in PCL/criteria for cells in PCL)

gouttegd commented 2 years ago

In the last CL call I said that I will give a concrete example of how the guidelines cited above have been used to decide whether to create terms in FBbt for possible new cell types, so here we are.

From the same paper (Cho et al., 2020), two clusters of previously unknown (at least in Drosophila) cells: one accepted as a new cell type (adipohemocytes) on the basis of their description and the experimental verification (not scRNAseq-based) of their existence, one rejected (GST-rich cells) on the basis that the authors themselves do not seem to believe they are a distinct cell type but rather a differentiation state of a pre-existing cell type (prohemocytes).

cmungall commented 2 years ago

These are great, given the importance of this overall discussion I think we should have something prominent on

https://obophenotype.github.io/cell-ontology/

We can provide links to various resources and external slide shows like this one (embedding google slides also works well in gh pages imho)

On Tue, Feb 15, 2022 at 12:21 PM Damien Goutte-Gattat < @.***> wrote:

In the last CL call I said that I will give a concrete example of how the guidelines cited above have been used to decide whether to create terms in FBbt for possible new cell types, so here we are https://docs.google.com/presentation/d/1_IeHRaHk9NYZdHHEoDrBhKFxdQV5OmgGGQm3iaAlUEg/edit?usp=sharing .

From the same paper (Cho et al., 2020 https://doi.org/10.1038/s41467-020-18135-y), two clusters of previously unknown (at least in Drosophila) cells: one accepted as a new cell type (adipohemocytes) on the basis of their description and the experimental verification (not scRNAseq-based) of their existence, one rejected (GST-rich cells) on the basis that the authors themselves do not seem to believe they are a distinct cell type but rather a differentiation state of a pre-existing cell type (prohemocytes).

— Reply to this email directly, view it on GitHub https://github.com/obophenotype/provisional_cell_ontology/issues/11#issuecomment-1040557187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOP4BHXPMQULRLIUAPTU3KDRRANCNFSM5JUFCVRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

lubianat commented 2 years ago

Just adding a thought: if different cell states are outside the scope of the Cell Ontology (or PCL) where should they be organized? I would imagine that concepts like "activated memory T cell" would have an id somewhere.

dosumis commented 2 years ago

There is some scope for cell-states in CL - we already have some for activated immune cells https://www.ebi.ac.uk/ols/ontologies/cl/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_0000896

However, we shouldn't be adding terms for things like X cell type but 'cycling'. This is a pretty common annotation in scRNAseq data, but I think requires some standard for combining CL terms with GO terms indicating state (or in this case, cell cycle phase) as part of curation.

shawntanzk commented 2 years ago

note to self: this should be further looked into for CL update paper

shawntanzk commented 2 years ago

@dosumis @gouttegd @scheuerm @cmungall @lubianat @bvarner-ebi + anyone else who is interested or has thoughts about this, here is my proposal for PCL vs CL (taking from discussions + Damiens set of rules above). As always feedback is always appreciated, I would like this to be refined through input from everyone and used as guidance for curators. This will also be added (in its refined form) to CL paper. Thanks all!

Note: this is not a static thing, it reflects what I think with the current state, and listening to discussions in multiple meetings, what the criteria should be now - this can of course change with changes in technologies etc.

Criteria for PCL:

It is a cell type and not a cell state
- E.g. “latent hemocytes” and “activated hemocytes” - Both clusters should be annotated as “hemocytes” – no new term is needed.
The claim is explicit
- No new term for a newly found cell type that is only casually mentioned in a figure legend.
Must always come with direct provenance of the dataset

Criteria for CL:

All the criteria for PCL
The cell type is not defined by clustering
- Note: the cell type might begin with being defined by clustering, but if ground truths about it can depend on categorical assertions that is discovered after the fact, it meets this criteria.
Cell type is not defined just by a single dataset and has additional data from other modalities (e.g. structure, location, function)
When in doubt, be conservative -> add to PCL instead

It belongs in PCL:

If the evidence for existence of the cell type comes from (a single?) single cell (transcriptomics?) dataset with no data from any other modality (e.g. morphology, location)
If the reference for ground truth for the cell type is based on single cell transcriptomics clustering. This claim is typically made based on a consensus clustering combing multiple datasets. This is increasingly the case for large brain cell type datasets where many new cell types are being uncovered by single cell transcriptomics and classical methods have not come close to defining an agreed, comprehensive catalog of cell types.

gouttegd commented 2 years ago

I am not sure about the requirement that a cell type must be found in more than a single dataset to be promoted from PCL to CL. If the other criteria are met (the cell type is not defined solely by clustering, there’s more data about it than transcriptional data), I think this should be enough to make it a “real” cell type.

gouttegd commented 2 years ago

And overall, I think it would be clearer to present the guidelines “top-down”, as in:

First, the CL criteria

your cell type passes them? => add it to CL
it doesn’t? => continue with the below criteria for PCL

Second, the PCL criteria

your cell type passes them? => add it to PCL
it doesn’t? Well, nothing to do, at least from the PCL/CL point of view

To “promote” a cell type from PCL to CL, simply re-evaluate the cell type following the same guidelines: if it now passes the CL criteria, then move it from PCL to CL; if it still doesn’t, leave it in PCL.

shawntanzk commented 2 years ago

@gouttegd how about the change I've made now to: "Cell type is not defined just by a single dataset and has additional data from other modalities (e.g. structure, location, function)" - that way its clear that, other modality which is a requirement, is another dataset that is needed

gouttegd commented 2 years ago

@shawntanzk That sounds good, yes.

shawntanzk commented 2 years ago

I think it would be clearer to present the guidelines “top-down"

Agree with this, have changed it accordingly, think it looks much better now :) was too much in PCL mindset I think lol Thanks!

dosumis commented 2 years ago

I think the top down approach would be nice if we can make it work, but it may be hard to do so. Here's my attempt to make Shawn's PCL inclusion criteria clearer.

It belongs in PCL:

If the evidence for existence of the cell type comes from (a single?) single cell transcriptomics dataset with no data from any other modality (e.g. morphology, location)
If the reference for ground truth for the cell type is based on single cell transcriptomics clustering. This claim is typically made based on a consensus clustering combing multiple datasets. This is increasingly the case for large brain cell type datasets where many new cell types are being uncovered by single cell transcriptomics and classical methods have not come close to defining an agreed, comprehensive catalog of cell types.

(I think important to reference transcriptomics here, although it is possible we will include other techniques in future. In Drosophila we have many cell types defined by clustering based on automated scoring of morphology + location (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4961245/)

shawntanzk commented 2 years ago

Have updated accordingly based on @dosumis comments.

I wonder if PCL is restricted to transcriptomics datasets? Are there other single cell techniques (eg FACS) that would be involved? I've about no knowledge about that type of stuff, I'm guessing immune side cell types might have some opinions, maybe @addiehl might have some opinions

gouttegd commented 2 years ago

On second thought, a problem with the “top-down” approach (consider CL first, then PCL second) is that it does not make it apparent that the PCL criteria are also CL criteria (a claim that does not meet the requirements for inclusion into PCL cannot meet the requirements for inclusion into CL).

As the guidelines are outlined above, a cell “type” could be included in CL even if it is actually more a cell state than a cell type, provided it is defined by more than a transcriptional profile.

So now I think it might be better to present the guidelines in a “bottom-up” fashion instead (sorry!):

Check the criteria for PCL inclusion (e.g. explicit claim, cell type rather than cell state, etc.).

Does the claimed cell type passes those criteria?

No => Don’t go any further, nothing to be added anywhere.
Yes => Consider the additional criteria below for inclusion to CL

Check the criteria for CL inclusion (e.g. a cell type that is not defined solely by a transcriptional profile, etc.).

Does the claimed cell type passes those additional criteria?

No => Add it to PCL.
Yes => Add it to CL.

shawntanzk commented 2 years ago

@gouttegd I think adding the criteria: "All the criteria for PCL" to CL and flipping them around works. The "it belongs in PCL" part can be a separate section as like a note/guidance kind of thing

gouttegd commented 2 years ago

Follow-up on this.

According to the discussion above, it seems to me that we roughly agree on what the criteria should be. However, we only have the opinions of those who participated in the discussion. Are there other stakeholders who should weigh in before we move forward with those criteria? Should they be discussed again in a CL meeting? (I’d personally say yes, given those criteria will impact both PCL and CL)
Assuming we and all involved stakeholders agree, where should the guidelines go? CL repo? PCL repo? Elsewhere?
For information, once those criteria are set, there is a plan to include them (or at least mention them) in the CL update paper that is currently in preparation.

scheuerm commented 2 years ago

Sorry, I am just now seeing this discussion chain and would like to review and consider in detail before a decision is made. But I will need a week.

patrick-lloyd-ray commented 2 years ago

Sorry I couldn't make it to the discussion. This looks pretty good and I agree with the general guidelines, but I have a question regarding the dataset requirements. Forgive me if this was already discussed on the call.

[For background, I'm thinking mostly about brain cells.] I agree that a single transcriptomic dataset should not be sufficient for inclusion in CL. However, there are many new multimodal techniques that generate more than one dataset for a single experiment (e.g., patch-seq). If we have patch-seq data from a population of cells, would that be sufficient to include those types directly in CL (provided we have morphology, ephys, and transcriptomic data for the types)? Similar remarks would apply to other techniques that generate data in more than one modality (anatomy, connectivity, et al.). Relatedly, if I sequenced a population of cells from one anatomical region, wouldn't I have data about anatomy and transcriptomics? Would I be able to add them directly to CL based on that information or would they still go through PCL first?

shawntanzk commented 2 years ago

If we have patch-seq data from a population of cells, would that be sufficient to include those types directly in CL (provided we have morphology, ephys, and transcriptomic data for the types)?

I think following the above guidelines, that is a yes (in general). Given that patchseq types are based on categorical assertions (ground truth isn't sequencing - eg L5 ET = layer 5 + ET) - we have done this for BDSO (patchseq types are in CL). However, there are some patchseq types which are a little bit trickier where they seem to only be grouped by transcriptomics (eg multiple morphologies + multiple locations in the cortex), in which case probably PCL. I guess the tricky part here is patchseq technically has its ground truth in scRNAseq still, but the cell type described is categorically asserted. We might have to add some additional guidelines for patchseq/multimodal-omics types?

if I sequenced a population of cells from one anatomical region, wouldn't I have data about anatomy and transcriptomics?

That's a good point, I guess we might have to talk a little about what differentiates sibling cell types then? Need to think of a way to carefully phrase this but basic idea would perhaps be: "If sibling cell types can only be differentiated by differences in clustering alone, it belongs in PCL"

patrick-lloyd-ray commented 2 years ago

We might have to add some additional guidelines for patchseq/multimodal-omics types?

Probably a good idea. I'd be happy to help if needed. I'd imagine BDSO group would want to provide input as well.

"If sibling cell types can only be differentiated by differences in clustering alone, it belongs in PCL"

Seems reasonable to me.

shawntanzk commented 2 years ago

Probably a good idea. I'd be happy to help if needed. I'd imagine BDSO group would want to provide input as well.

Good point - I will add it to the agenda for next ontology call

scheuerm commented 2 years ago

One of the criteria proposed for inclusion in PCL is that it has to refer to a cell type rather than cell state. But there are already several examples of cell states in CL itself (e.g., http://purl.obolibrary.org/obo/CL_0001043 ). This also requires a clear understanding of the differences between types and states (related to transient reversibility). I have my own opinion about these difference, but I don't think there has been broad agreement by the community.
There seems to be an underlying theme that for some reason evidence for the existence of a cell type derived from the clustering of gene expression data is somehow less reliable/compelling than "categorical assertions". First of all, I'm sure I understand what is meant by categorical assertions or where these come from. Second, I think that clustering of transcriptomic data can be very strong evidence of the existence of a discrete type. In many ways, I think it is more objective and reproducible. I know that CL cell type were defined differently historically, but that doesn't mean that approaches based on current technology advances are equally valid.
I agree that evidence should come from multiple experiments, but would be fine if they were multiple independent scRNAseq experiments.

shawntanzk commented 2 years ago

One of the criteria proposed for inclusion in PCL is that it has to refer to a cell type rather than cell state. But there are already several examples of cell states in CL itself (e.g., http://purl.obolibrary.org/obo/CL_0001043 ). This also requires a clear understanding of the differences between types and states (related to transient reversibility). I have my own opinion about these difference, but I don't think there has been broad agreement by the community.

I think this has been brought up before, and @addiehl did mention the difficulty of this with immune cells, maybe he can comment on that? I know HongKui has just written some stuff about cell types vs states too, its on my reading list and will get to it soon, pretty sure @gouttegd has read it, maybe he can chime in on this too? I think we still need to gatekeep cell states somehow though. But very valid a point and 100% needs some discussion here.

There seems to be an underlying theme that for some reason evidence for the existence of a cell type derived from the clustering of gene expression data is somehow less reliable/compelling than "categorical assertions". First of all, I'm sure I understand what is meant by categorical assertions or where these come from. Second, I think that clustering of transcriptomic data can be very strong evidence of the existence of a discrete type. In many ways, I think it is more objective and reproducible. I know that CL cell type were defined differently historically, but that doesn't mean that approaches based on current technology advances are equally valid.

I'll preface this part by saying that I may or may not know what I'm talking about here, it's my thoughts on this which admittedly might be naive.

I think catergorical assertions are sort of yes or no assertions - eg does the cell type have pyramidal morphology, or does the cell type project extratelencephalically. Clusterings, however, are based on algorithms and are tricky to define with yes or no assertions. I get that things like NS forest does help with this allowing some form of categorical assertions on it (eg does the cell type express these set of markers), however, this can change with algorithms and probably applies to a large majority of the cells but not all.

Don't get me wrong, I do agree that clustering can indeed be very strong evidence for a cell type, however, I think the objective and reproducible part might be more on the annotations - we can annotate cells more objectively and reproducibly with clustering, but im not sure about reproducibility in getting the same cell types (eg if I get another set of data from another tissue, will these cell types show up again, or would clusterings change accordingly). Consensus transcriptomics definitely does help with this, but I'm guessing given more data, clusterings are likely to change as well?

I think perhaps what I am trying to elude to is that its not that "catergorical assertions" are less reliable/compelling than clusterings, but rather it is harder to define the sort of (what I think of as) "socratic form" (ideal perfect form) [forgive my lack of philosophical knowledge, I'm pretty sure I'm misusing this, might be the dangers of a little knowledge is a dangerous thing >.<] of the cell type with clustering as ground truth. Whereas with more "traditional" categorical assertions, a "socratic form" of cell type is easier to define - all pyramidal neurons must have pyramidal morphology and all extratelencephalic projecting neurons must project extratelencephalically, whereas with transcriptomically defined cell types, if more data is added or algorithm changes, the clusterings changes, and the "essence" of what makes the cell type that cell type changes which then brings into question what actually defines that cell type.

Practically, I think its also that terms that are defined by clusterings are limited in the annotation capabilities too -> if a cell type is defined by clusterings in that set of cells (eg L5 ET_1 is equivalent_to 'native cell' and ('has exemplar data' value 'RNAseq 070 - CS202002013_70') then it should not be annotated on other datasets, which questions their purpose in a reference ontology like CL (CL already groups PCL terms which can be used for querying, this includes patchseq type what BDS calls "subclass nodes" which I assume will be used for annotations more) -> though this might be highly BDSO specific a thought -> probably where people working outside the brain stuff can comment

Did that all make sense or is it more like confused ramblings?

dosumis commented 2 years ago

Thanks @shawntanzk - very clear. The only bit I'd question is this:

Practically, I think its also that terms that are defined by clusterings are limited in the annotation capabilities too -> if a cell type is defined by clusterings in that set of cells (eg L5 ET_1 is equivalent_to 'native cell' and ('has exemplar data' value 'RNAseq 070 - CS202002013_70') then it should not be annotated on other datasets.

It absolutely should be annotated on other datasets, based on projection using one of the various algorithms available. That's the entire point of having reference data. We have a similar situation with several thousand neuron types in fly defined only by reference data => morphology/connectomics.

dosumis commented 2 years ago

I think NS Forest on whole brain does => us the possibility of reliable categorical definitions. It will be interesting to see how this develops with the mouse brain data.

shawntanzk commented 2 years ago

It absolutely should be annotated on other datasets, based on projection using one of the various algorithms available.

ah got it, when writing that I was wondering about how we use it given the equivalent class, but projections onto them make sense. Thanks

dosumis commented 2 years ago

The EquivalentClass axiom is partly there to do work for us - translating the taxonomy hierarchy into a class hierarchy,

gouttegd commented 2 years ago

There seems to be an underlying theme that for some reason evidence for the existence of a cell type derived from the clustering of gene expression data is somehow less reliable/compelling than "categorical assertions".

This might come at least in part from my criteria for inclusion in the Drosophila Anatomy Ontology (which were used as a starting point to define the criteria discussed above).

I have indeed taken a very conservative approach when curating fly scRNAseq papers. To give you an idea why, please consider for example this paper that reports a scRNAseq analysis on fly blood cells. The authors report a dozen of different clusters corresponding to plasmatocytes (PL-0, PL-2, PL-3, PL-Rel, PL-vir1, PL-robo2, PL-Pcd, …); should those clusters be used as the basis to create subtypes of plasmatocytes in the Drosophila Anatomy Ontology? I’ll argue that none of them actually describe a cell (sub)type.

For example, if we consider the PL-Rel cluster, this is the full extent of what the authors have to say about those cells:

The PL-Rel cluster includes 12.6% of the total hemocyte population, with more than 100 strong markers (Dataset EV2), most of which are involved in the immune response (Fig 2C). The cluster expresses the main transcription factors of the Imd pathway (i.e., Relish, Rel) and of the Toll pathway (i.e., Dorsal, Dl), Cactus (Cact), and the secreted protein PGRP-SA (Govind, 1999; Valanne et al, 2011; Zhai et al, 2018; Dataset EV2). In addition, it expresses proteins associated with the JNK pathway such as Jra and Puc (Martin-Blanco et al, 1998; Zheng et al, 2017). The qPCR assay on resident and circulating hemocytes indicates that PL-Rel hemocytes are present in both compartments (Fig 3A).

To me, PL-Rel is merely a name put on a cluster, not a cell type. It may correspond to an actual cell type, but I’d require more information before adding that to the Drosophila Anatomy Ontology. Importantly, the authors themselves are not saying those clusters are cell types. They only talk about “clusters” or “cell populations”.

It’s not that I don’t trust scRNAseq to find new cell types. But I consider that a scRNAseq cluster is nothing more than a scRNAseq cluster until and unless the authors make the explicit claim (which they can then back up with whatever data/arguments they have) that it corresponds to a new cell type – if they don’t make that claim, I believe curators and ontology editors should not make it for them.

scheuerm commented 2 years ago

@gouttegd so if the author simply claims they are cell types, then you would be ok with labelling them as cell types? The paragraph you quote from the paper about PL-Rel sounds to me like pretty strong evidence that a unique cell phenotype has been identified, which to me is the main requirement for defining a cell type.

scheuerm commented 2 years ago

@shawntanzk "I think catergorical assertions are sort of yes or no assertions - eg does the cell type have pyramidal morphology, "

When someone is evaluating an image of a cell to determine if it has a pyramidal shape, they are actually running a comparative analysis algorithm in their mind, comparing against reference pyramids to judge the extent of similarity and use a subjective threshold to assert yes or no. I would argue that the comparative analysis computational algorithms we are using are more objective and reliable. We still use threshold, but the approach is much more reproducible.

And I would also argue that I can make these types of assertions - if a cell in the human middle temporal gyrus expresses NDNF and SV2C, then it is definitely an Inh L1 LAMP5 NMBR (yes).

Having said that, I would still like to see the results reproduced in more than one experiment.

shawntanzk commented 2 years ago

I would argue that the comparative analysis computational algorithms we are using are more objective and reliable. We still use threshold, but the approach is much more reproducible.

Agree with this, however, I still think that's more on the annotation side - someone thinks that is pyramidal through evaluation with their eyes and annotates that cell as pyramidal. However a "socratic form" of pyramidal defines the cell type and we think using our eyes that the cell shares enough similarities to the "socratic form" of pyramidal to be defined as a pyramidal. An algorithm probably does the detection part better, but the problem again is I don't quite know what the "socratic form" of a data driven cell type is as when the clustering changes, the cell type loses its "essence", whereas regardless more data, or better computational analytical tools, the "socratic form" of pyramidal is still pyramidal. We might be able to better spot pyramidals, or computer tools might be able to pick up pyramidals better, but the form remains stable.

And I would also argue that I can make these types of assertions - if a cell in the human middle temporal gyrus expresses NDNF and SV2C, then it is definitely an Inh L1 LAMP5 NMBR (yes).

I think this is where @dosumis mentioned if we have the whole brain with NS forest, there's a case to be made that it can move to CL. I think our worry is that clusterings will change with different data coming in. I do get that with location + NS forest, we can technically define a "socratic form" that would ideally differentiate from any other cell like what you mentioned, but I think a worry is this "socratic form" changes with clustering changes, which questions the "essence" of the cell type. I wonder if it perhaps might more have to do with stability - I think there is a case to be made that as we learn more about a cell type, we might find that its "essence" changes too, but I do think that more traditionally categorically defined cell types are more stable at this point. However, that is also hard to quantify if we want to have criteria for PCL vs CL.

I think in terms of writing this out - this might be a tricky one, cause even with whole brain mapping, technically with our current criteria, it should still go into PCL (data driven). Though one can argue that markers that define the cell can be counted as categorical assertions (so in terms of BDS dataset - without NS forest, would belong to PCL, with NS forest technically allowed in CL). But then the question is why isn't BDSO cell types in CL instead? Some workshopping on the writing is defs needed.

PS apologies for the super long writing again >.<

scheuerm commented 2 years ago

@shawntanzk can you define what you mean by "socratic form"?

shawntanzk commented 2 years ago

@shawntanzk can you define what you mean by "socratic form"?

again gonna preface this by saying my philosophical knowledge is insanely limited, so i might be misusing it (why i keep putting it in quotation marks) but I basically think it as an ideal perfect form (which I think philosophically means it doesn't actually exist but that is not super crucial in the way I use it here). So back to our example on pyramidal - there is an ideal perfect pyramidal, and we call something pyramidal if in essence it matches the ideal perfect pyramidal, but all pyramidal cells are slightly different (even exact cell types - slight angle differences, slight length differences, etc.) hence the "socratic form" being an ideal rather than a real thing, but I digress. I think annotation is when we make a judgement on if it matches close enough to the ideal perfect pyramidal - which I agree with you, transcriptomic clustering is probably better at that, but I think that is a separate issue to defining what the ideal perfect form is.

So in our context, if I'm trying to define an ideal perfect form of the cell type that was defined by the cluster, I feel like its a bit of a challenge. For example if I say the ideal cell type of the cluster expressed gene X and Y. However, when I run the clustering with different data, it says what defines that cluster is actually gene A and B, and also, that cluster is not really that cluster but one of slightly different shape, then what is actually the ideal perfect form of that cell type and how do we define it?

Hope that helps and not confuses :)

patrick-lloyd-ray commented 2 years ago

@shawntanzk: I think the term you are looking for is "Platonic form."

shawntanzk commented 2 years ago

@shawntanzk: I think the term you are looking for is "Platonic form."

omg I always thought it was "Socratic form" cause Plato was writing about discussion or something with Socrates ahahhaha. Thanks @patrick-lloyd-ray :D

scheuerm commented 2 years ago

Thanks guys, that helps.

So, our experience that is relevant to Shawn's description is that we clustered human MTG and got our cell types and marker genes, we clustered human M1 and got our cell types and marker genes, we matched cell type clusters between the two brain regions and found that most of the GABAergic and glial cells match, whereas the glutamatergic cell types did not match (which I find quite interesting). For the GABAergic cells that matched between the two brain regions, the minimum NS-Forest marker sets often differed, but the matched cell types in the two brain regions did express both sets of NS-Forest markers.

So I can say that if a cell in the human middle temporal gyrus expresses NDNF and SV2C, then it is definitely an Inh L1 LAMP5 NMBR (yes). and if it's in M1 it expresses genes X and Y.

shawntanzk commented 2 years ago

@scheuerm I agree that that case is one that can be considered for entry to CL (similar to all the BDS work I guess), however, I guess with the knowledge that whole brain work is upcoming, it might make practical sense to keep it in PCL first - we will have a whole new set of NS forest (right?) and that might be a more stable one that will be used moving forward. I think we might have to relook at NS forest equivalence again? thoughts on this @dosumis? I know we don't have equivalence for many CL classes, so an argument can be made that that isn't needed, but I guess when that is the only differentiating factor, we should consider looking into that again. Anyway we might be getting too into specific dataset questions now.

I think then pulling it back to the question of criteria for CL vs PCL - how do we formalise something like that? do we want to accept data driven cell types into CL if they can be formalised with categorical assertions that can be data driven too?

A practical question that I have been having (not fully about this topic, but I'm putting it here cause its related) is how CL would deal with the huge numbers of cell types coming in - it would make the ontology insanely difficult to load and work with etc. this shouldn't affect the criteria, but just an open thought/question.

gouttegd commented 2 years ago

so if the author simply claims they are cell types, then you would be ok with labelling them as cell types?

I would at least consider it, though in this particular case I would probably still be reluctant to do it.

The paragraph you quote from the paper about PL-Rel sounds to me like pretty strong evidence that a unique cell phenotype has been identified

I disagree. All they know about cells in in the PL-Rel cluster is that they express some combination of genes. How is that a phenotype?

For example, “they express the main transcription factors of the Imd pathway and of the Toll pathway”. Great, so what? Do we have any evidence that the Imd pathway and the Toll pathway are activated in those cells? Any read-out on the genes that are known to be downstream of those pathways?

This is exactly what makes me wary of scRNAseq papers (and I say that as my main job is to curate scRNAseq papers): authors of such papers are sometimes very quick to jump to conclusions on the sole basis of a transcriptomic profile. I’ve seen some papers in which the authors found a cluster whose cells start expressing a glucose transporter upon some conditions, and from that they will conclude that upon those conditions those cells switch to a glucose-enhanced metabolism, even without ever looking at the metabolism of those cells…

To give another example and since we have been talking about brain cells previously: just because a cluster is found to express a gene known to encode a sugar receptor, does not mean that you can conclude that you have found a new type of sugar-sensing neurons. You would first need to show that those cells are actually responsive to sugar. Until you do that, all you have found is a population of cells expressing a gene.

scheuerm commented 2 years ago

Based on the current criteria for migration of terms from PCL to CL being considered, we will need to anticipate many cell types (maybe ~95% of brain cell types) only residing in PCL and never migrating to CL. One concern is that if most cell types are only present in PCL, CL may become irrelevant.

shawntanzk commented 2 years ago

From discussion we had in BDS ontology group:

Having a product with PCL + full CL would be useful -> does this live in CL or PCL?
Clear path for PCL terms to go into CL needs to be noted in CL -> this helps PCL stay relevant, and clear guidelines is good anyway @tgbugs & @scheuerm might have more to add to this

tgbugs commented 2 years ago

Could this be an opportunity to start standardizing a process for how the more experimentally derived cell types can make their way into CL proper? There are going to be many e.g. transcriptomic types that it would be quite helpful to have stable identifiers for, and the earlier we can get those types properly identified, the easier it will be to track them as the evolve as more data is acquired from additional modalities.

Providing a way to load various provisional cell type ontologies alongside CL, as well as guidance for how to create a provisional ontology and how to request review of provisional types for promotion (maybe pick a different word?) could be a direct outcome here?

addiehl commented 2 years ago

I do not think that the Cell Ontology will become irrelevant, because it is being used and expanded for many strong projects like HuBMAP that rely on more than just clustering to identify cell types, and because the less granular, but well defined cell types in CL represent the anchor points for PCL cell types to hang off of.

Speaking as an outsider, I see the risk of the proliferation of PCL cell types is that some, perhaps many, of the potential additions to PCL will be latently redundant with each other, because different analysis algorithms are likely to propose different combination of marker genes to uniquely identify particular cell types, but for many cell types there many be independent combinations of marker genes that can uniquely identify them. It seems to me that NSForest applied locally is going to pick somewhat different and perhaps shorter lists of markers than NSForest applied across the whole brain. Thus, it will be necessary to identify and reconcile different sets of markers for the same cell type before we can be sure that some or many PCL classes can be safely promoted.

addiehl commented 2 years ago

Furthermore, in such an anatomically rich structure such as the brain, with a wide range of morphologies of cell types represented, we would really like to have image correlates for clustered cell types.

addiehl commented 2 years ago

[addressing earlier parts of the discussion] The quite limited number of activated T cell types in CL were added per the annotation needs of data curators. We have not proactively added additional activated T cell types because (basically) I (and others) have been opposed to adding cell states. However, with cell clustering data and cell trajectory data, it becomes harder to apply a bright line between cell type and cell state, as an activated state of an identifiable cell type may in fact represent an in-between differentiation step.

For stable cell types, such as neurons or 'endothelial cell[s] of high endothelial venule', cycling between activation states and resting states presumably occur without change in the fundamental cell type. So perhaps CL should only represent the base cell types, but of course some may argue that we need to have representations of the activated forms somewhere, so data can be appropriately annotated. I am not sure that CL is the worst place to keep such representations, as then at least there will be a connection to the base cell type, but we may want to continue to add such representations only as requested.

cmungall commented 2 years ago

I think it is easiest if dependencies follow subClassOf, and we avoid reciprocal dependencies. So PCL can import some or all of CL. But I can imagine people wanting many possible combinations, e.g. with uberon. The base approach should make this mixing and matching easier.

We should be clear about the migration procedure. I assume that once a PCL class "matures" the PCL ID is obsoleted and replaced_by a CL ID. CL will not "adopt" PCL IDs.

On Mon, Aug 8, 2022 at 5:06 PM Shawn Tan @.***> wrote:

From discussion we had in BDS ontology group:

Having a product with PCL + full CL would be useful -> does this live in CL or PCL?

Clear path for PCL terms to go into CL needs to be noted in CL -> this helps PCL stay relevant, and clear guidelines is good anyway @tgbugs https://github.com/tgbugs & @scheuerm https://github.com/scheuerm might have more to add to this

— Reply to this email directly, view it on GitHub https://github.com/obophenotype/provisional_cell_ontology/issues/11#issuecomment-1208318461, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOK7LRCXHD3QFGMTEU3VYEV7HANCNFSM5JUFCVRA . You are receiving this because you were mentioned.Message ID: @.***>

emquardokus commented 2 years ago

@shawntanzk First, it would be good to articulate how PCL was established--what was the criteria for why it exists at all. Second, who has so far contributed to PCL Third, establishing a general criteria for PCL vs CL, as you've mentioned.

shawntanzk commented 2 years ago

@emquardokus, fortunately I've already written out the first and second point for the CL update manuscript :)

PCL is an ontology created to record claims of potential new cell types derived from single cell data analysis, where there is no additional evidence that supports the addition of corresponding classically defined cell type to the Cell Ontology (CL) (Bakken et al. 2017). PCL hence provides a place for terms that might not be ready for CL. PCL was first created to hold cell types terms for scRNAseq experiments done on the middle temporal gyrus (MTG) (Bakken et al. 2017; Hodge et al. 2019). It has now expanded to also hold include cell types from the Brain Data Standards Ontology/BICCN (BRAIN Initiative Cell Census Network (BICCN) 2021; Tan et al. 2021), and now uses the ODK (Matentzoglu et al. 2022b). PCL terms are classified using parent terms from CL and it, therefore, acts as a fully integratable extension to CL.

Tagging @scheuerm who started PCL and might also be able to give better context to anything I missed out in what I wrote (PS happy to edit that for CL manuscript and all too :))

obophenotype / provisional_cell_ontology

Documentation for defining cells in PCL #11