markowetzlab / scDNAseq-workflow

calling absolute copy numbers on single cell DNA sequencing data - stagging repo for public release
4 stars 2 forks source link

pseudodiploid/haploid cells as output #27

Closed Urja25 closed 1 month ago

Urja25 commented 2 months ago

Hi,

When I ran scAbsolute with the default ploidy window I noticed that a lot of cells (~50%) had copy number of 1 across the genome, which is not something we expect to see in the samples (blood cells, should be mainly diploid). Therefore, I adjusted the ploidy window to be between 1.2-2.9 and this resulted in a much lower proportion of cells with CN across the genome as 1, yet, about 10-15% of such "haploid-looking" cells remain. I am wondering why this must be the case since the rpc values are good and reads in the cell are also sufficient for CN calling. At first, I noticed this with just my sample, however, we are observing the same phenomenon in fibroblasts for example. A bit strange since we dont see this when running HMMCopy.

Thanks a lot!

leachim commented 2 months ago

Hey, that does seem a bit odd. In the first place, I'd always run scAbsolute with the setting of minPloidy at 1.1 or 1.2, as it's set by default. The reason is that you wouldn't typically see cells with a ploidy of 1, as they shouldn't survive with so many genome losses, and the pipeline selects the minimum ploidy by default - if you set it to one and it is a truly diploid cell (without X chromosome) it will predict a ploidy of one which is obviously wrong. If you have samples from a female, it also sometimes helps to include the X chromosome for ploidy estimations (but copy number calls on the X chromosome tend to be less reliable). This is not a problem for maxPloidy, because scAbsolute tends to prefer lower solutions if plausible. For the remaining haploid looking cells, can you send a screenshot or describe a bit more in detail what they look like? scAbsolute shouldn't call anything with a ploidy less than the minimum ploidy, so I assume the predicted ploidy is somewhere between 1 and 2?

Urja25 commented 2 months ago

Also, the ploidy parameter you see on the heatmap is actually the "ploidy.mod" feature in the QDNAseq object which I realized does not represent the ploidy value, rather the average CNA across genome per cell if I am not mistaken. So please ignore that!

leachim commented 2 months ago

Okay, so a few things here. I think first you should run the cell cycle classifier to exclude cycling cells (cells in S-phase). It should be fairly straightforward using the tutorials in this repo to do that. I think some of the cells are probably just cycling, by the looks of it. If you use a cell line this can be quite a large number of your cells (> 20%), see the manuscript for examples. And then some of the cells just look like the sequencing has failed, e.g. the very last picture and number 5. I don't think the methods are currently reliable enough to identify a single weird cell as a monster cell, it's much more likely it's just a sequencing/sample processing artefact. The waviness you see on the first plot is just GC content, I think there is an option to plot a corrected image with the plotCopynumber function in the package. The second manuscript scUnique has an integrated copy number caller that will give you better calls and deal with that better than scAbsolute, but it depends what you want to achieve with your analysis.

leachim commented 2 months ago

Also for the filtering I think I have one tutorial just for that, have a look at that for some inspiration. If you have very low coverage (rpc < 25 at 1MB resolution, but for that the ploidy estimate has to be correct, because rpc depends on the ploidy estimate), you could run the analysis for all cells with rpc > 25, identify the ploidy peaks and then rerun the analysis with very constrained ploidy estimates and potentially a more sensitive segmentation. That's a bit of a hack, but it might give you more cells to work with. Depends a bit on how many cells fall foul of the filter....

leachim commented 2 months ago

Oh, and I made a mistake, I meant male sample is easier with one X chromosome, because you have a step between X and the other chromosomes in normal samples. So might be worth keeping that in depending on whether you care more about accurate ploidy or accurate copy number calls on the X chromosome. I would not keep the Y chromosome in, it's just very unreliable and usually not too informative either.

leachim commented 2 months ago

Having noticed that there are still some cells that may have a ploidy < 2 or > 2, which could represent polyploid, or pseudiploid cells respectively, I set a ploidy threshold of 1.8-2 within this dataset such that I only retain "diploid" cells. This is obviously not a perfect method since the cells with ploidy >2 could represent cells of interest. However, I currently don't know how to distinguish G2 cells from pathologically polyploid cells. The heatmap and single cell plots are as follows:

I think that's too strict. I would set it to median(ploidy) +- 0.5, otherwise you might exclude some perfectly fine cells.

leachim commented 2 months ago

But by the looks of it, you still have quite a lot of outliers.

leachim commented 2 months ago

You could try to make the cell cycle classifier a little bit stricter, but looking at the distribution that might mean excluding a fair number of good cells, too. Maybe try first to run the other qc methods (qc_gini and qc_mapd) that are described in the vignette. I have given you access to the scUnique part of the pipeline, if you run that, you should also get a better phylogeny. Unfortunately, you have to try a little bit with the thresholds, there is no good automatic way to fix the thresholds for good/bad cells.

Urja25 commented 2 months ago

Thanks, I am going to attempt scUnique as well. For now, I am deleting my comments with heatmaps.

leachim commented 2 months ago

Okay, let me know how that goes. The output should be a bit more informative, especially as you can see whether the noisy cells have any sort of structure. Hard to see that in the other plots.

Urja25 commented 1 month ago

Hi Michael. So I re-ran scAbsolute for another sample (min ploidy 1.2, max ploidy 5) without considering sex chromosomes (at 1000kb). All cells passed with the first run of scAbsolute. I then ran the qc script, having changed the thresholds for gini and mapd arbitrarily (the ones used in the original qc script are too stringent for our scDNA method). Here is the code:

` LFS01CP_12_5 <- readRDS("/home/u855h/chromothripsis/urja/scDNAseq-workflow2/results/1000/LFS01CP_12_5_1000.rds")

df = Biobase::pData(LFS01CP_12_5) df df <- df %>% dplyr::mutate(UID = "LFS01CP", SLX = "LFS01CP")

1. number of reads extreme outliers

reads_cutoff = quantile(df$used.reads, c(0.025, 0.975)) # by default we remove the top and bottom 2.5% of cells with very high or low read number p_coverage = ggplot(data=df %>% dplyr::select(rpc, used.reads, UID, SLX)) + geom_histogram(aes(x=used.reads, y=..density.., color=interaction(UID, SLX)), bins = 100, position="identity", fill="white") + geom_density(aes(x=used.reads, color=interaction(UID, SLX))) + geom_vline(xintercept = reads_cutoff, color="red") + theme_pubclean() issue_coverage = colnames(LFS01CP_12_5)[(Biobase::pData(LFS01CP_12_5)[["used.reads"]] < reads_cutoff[[1]]) | (Biobase::pData(LFS01CP_12_5)[["used.reads"]] > reads_cutoff[[2]])] p_coverage if(any(Biobase::pData(LFS01CP_12_5)[["rpc"]] < 25)){ warning("Some cells have an rpc values of < 25!\nThis might indicate that you are under powered to detect rCNAs and copy number alterations at this bin size.") print(Biobase::pData(LFS01CP_12_5)[["name"]][Biobase::pData(LFS01CP_12_5)[["rpc"]] < 25]) }

manually add cells that should be excluded

issue_manual = c() exclude_cells = base::union(issue_coverage, issue_manual) stopifnot(all(is.character(exclude_cells))) include = setdiff(colnames(LFS01CP_12_5), exclude_cells)

LFS01CP_12_5 = LFS01CP_12_5[, include] df = df %>% dplyr::filter(!(name %in% base::union(issue_coverage, issue_manual))) stopifnot(dim(df)[[1]] == dim(LFS01CP_12_5)[[2]]) print(paste0("ISSUES: ", base::union(issue_coverage, issue_manual)))

df df$mapd summary(df$mapd) df$gini_normalized summary(df$gini_normalized)

df <- df %>% dplyr::filter(!is.na(gini_normalized))

PARAMETERS ====

cutoff_replicating = 1.0 cutoff_replicating_iqr = 1.5 cutoff_replicating_hard = NULL

MAPD

mapd_cutoff = 0.1 mapd_density_control = FALSE

Gini

gini_norm_cutoff = 0.2 gini_density_control = FALSE

alpha

alpha_cutoff = 1.5 alpha_hard_cutoff = 0.05 # 0.05: about 99 percentile of data for 500 kb print(paste0("Number of cells: ", dim(LFS01CP_12_5)[[2]]))

Identify replicating and outlier cells ====

2. S phase cells

df = predict_replicating(df %>% dplyr::mutate(UID2 = UID, SLX2=SLX), cutoff_value=cutoff_replicating, iqr_value = cutoff_replicating_iqr, hard_threshold = cutoff_replicating_hard) fig_cellcycle = ggplot(data = df) + geom_quasirandom(aes(x=SLX, y=cycling_activity, color=replicating, name=name), size=0.8, alpha=1.0) + geom_hline(aes(yintercept=cycling_cutoff_sd)) + geom_hline(aes(yintercept=cycling_median), linetype="dashed") + facet_wrap(~UID, scales = "free_x") + theme_pubclean() fig_cellcycle

issue_sphase = df$name[df$replicating]

df$cycling = df$name %in% issue_sphase table(df$cycling) prop.table(table(df$cycling, df$UID), margin = 2)

3. QC control (after removal of cycling cells)

iq = qc_alpha( qc_gini_norm( qc_mapd(df %>% dplyr::filter(!replicating), cutoff_value = mapd_cutoff, densityControl=mapd_density_control), cutoff_value = gini_norm_cutoff, densityControl = gini_density_control), cutoff_value=alpha_cutoff, hard_cutoff = alpha_hard_cutoff) iq$outlier = iq$dmapd.outlier | iq$dgini.outlier | iq$alpha.outlier aq = dplyr::left_join(df, iq, by="name", suffix=c("", ".y")) %>% dplyr::mutate(outlier = case_when( replicating ~ "replicating", name %in% iq$name[iq$outlier] ~ "outlier", TRUE ~ "pass"))

alpha_value = 0.7 scale_values = c(alpha_value, 0.1, alpha_value, alpha_value)

4. Ploidy (cells with known wrong ploidy can be removed)

issue_ploidy = base::union(iq$name[iq$ploidy < 1.1], iq$name[iq$ploidy > 8.0])`

where:

summary(df$mapd) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.07655 0.20317 0.29992 0.30909 0.40495 0.58653 summary(df$gini_normalized) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.06329 0.12960 0.18764 0.18809 0.24773 0.54297 17

table(df$cycling) FALSE TRUE 5414 1553 prop.table(table(df$cycling, df$UID), margin = 2) LFS01CP FALSE 0.777092 TRUE 0.222908

table(df$remove) FALSE TRUE 3028 3939

print(prop.table(table(df$keep, df$UID), margin=2))
LFS01CP FALSE 0.7882876 TRUE 0.2117124

print(table(df$keep, df$UID))
LFS01CP FALSE 5492 TRUE 1475

image

image

image

image

image

I am now running scUnique for this sample. In the mean time, could you help me make sense of the QC I am observing?

Thanks a ton!

leachim commented 1 month ago

Looking at your ploidy plot, it doesn't seem correct to put the max ploidy cutoff at 5. I'd keep it at 8 and see how that changes things for the right part of the distribution. It's a bit surprising that you don't see any peak for the non-diploid population, but I would need to know more about the biology of the cells to make sense of it. More generally, you need to be fine with the QC part before you progress to running scUnique, since it should use only the cells that passed the QC analysis. Otherwise you can't trust the results. Most important, I'd recommend to split up the QC into two parts. Separate the cell populations based on ploidy and run it separately for the diploid and non-diploid cells, e.g. cells with ploidy > 3 based on your last plot. This way, the QC plots should look much better. The issue you observe stems from the fact that you mix two very different cell populations and you have many more diploid cells, so the fitting isn't right. redo the analysis with this split and the standard parameters and see if it makes more sense. Actually, may I use the last plot to add a point to the FAQ about this? It seems like an obvious issue people might run into.

Urja25 commented 1 month ago

Is it possible to email you regarding the biology of the samples? I would like to have this information protected. I cannot seem to find your email address.

leachim commented 1 month ago

You can just find my email on this page, if you look for the first author on the scAbsolute paper. https://bsse.ethz.ch/cbg/group/people.html#s