nolanlab / citrus

Citrus Development Code
GNU General Public License v3.0
31 stars 20 forks source link

Identifying Cell Type of CITRUS Output #94

Closed chethanjjj closed 8 years ago

chethanjjj commented 8 years ago

One of the challenges we have come across from the Citrus output is in identifying the cell type based on the fluorescence profile, specifically how to define if a cluster is positive or negative for a marker. We were wondering what your method was to identify the cell types? I read through your supplemental material and I see you talk about the F1 score and how you use it to compare manually-identified cell clusters to your automated-identified clusters. Is this your method?

rbruggner commented 8 years ago

The F1 score was used to score how well the clusters produced by hierarchical clustering matched populations identified by manual gating so I don't think it will be helpful for defining the positive / negative status of a marker.

Unfortunately, I don't have a good answer for you with respect to calling a population positive or negative for a marker. In my work, many of the markers used to identify typical immunological phenotypes displayed bimodal behavior in identified clusters - OR just appeared to have no bias in the cluster at all. I basically ended up looking at the marker distribution and made a call by eye to determine if I thought the marker was important in distinguishing the cluster.

Going forward, I suspect many markers will not be bimodal and therefore, much more difficult to judge. In theory, what you'd really like to know is if a particular marker is necessary for identifying a relevant cellular population. If you used citrus to identify a population of interest (i.e. a population that was predictive of an endpoint of interest), you could rerun citrus and "remove" markers from the clustering channels that appeared to be irrelevant and see if you could still identify that predictive cluster. If removing a particular marker resulted in significantly worse results, you might conclude that the removed marker was important for identifying that predictive cluster.

Hope that helps a little....

chethanjjj commented 8 years ago

Ahh I see. Interesting, thats very similar to my approach (paragraph 2). We've also found when you control for the number of markers to cluster and the MinimumClusterSize, we are able to tease out certain populations found in other papers (eg. CD3+CD4+ and CD3+CD8+ populations). Thanks for getting back to me Dr. Bruggner!

chethanjjj commented 8 years ago

Hi Dr. Bruggner, I'm trying recapitulate your F1 score experiment. With most clustering algorithms, events in each cluster are mutually exclusive, however with hclust, this is not the case, did you try to account for this? I see in your supplemental material Table S2, for each of the FlowCAP-I datasets, you set a minimum and max cluster size, did you do this so you could isolate the clusters with mutually exclusive events and remove clusters that contain multiple populations?

rbruggner commented 8 years ago

I did not try and account for the fact that cells belong to more than one cluster in the scoring - this was by design. However, comparing the precise F1 scores from a clustering algorithm that is trying to assign each cell to exactly one cluster and hierarchical clustering that assigns each cells to multiple clusters is not a fair comparison.

I used the F1 score to show that Citrus could re-identify manually gated populations at some point in the clustering hierarchy, but not to show that it identified those manually gated populations only. In the paper, I use the term clustering sensitivity because I want to make the claim that hierarchical clustering can identify manually gated populations, but absolutely, it does identify "false positive" populations during the process and the F1 score does not capture this false positive rate. For Citrus, I'm ok with the false positives because I just let the regularized classification models weed out the false positives when creating the predictive models.

To understand this logic, we need a little bit of backstory: The flow cytometry informatics community has consistently focused on automated re-identification of populations that have been identified using manual gating. While that is one objective, it's not the objective of Citrus. The objective of Citrus is to identify any population that's predictive of some outcome variable, and to do that, it doesn't necessarily need to identify the same populations that we identify using manual gating. However, I was unable to get my paper through review without addressing the issue of "does citrus/hclustering identify manually gated populations" (which I again think is a different goal). To address this, I evaluated hierarchical clustering's ability to re-identify populations in the flowcap datasets. I think the results show that hierarchical clustering can re-identify those populations somewhere in the clustering hierarchy, but of course, it certainly identifies plenty of false-positive populations (Figure 3b). Again, hierarchical clustering does not assign each cell to precisely one cluster, and the scoring metric does not account for this, but that is not the objective of Citrus.