saeyslab / CytoNorm

R library to normalize cytometry data
33 stars 6 forks source link

nClus parameter not working #14

Open emmanuelaaaaa opened 4 years ago

emmanuelaaaaa commented 4 years ago

Hello again :),

I have been running into a weird issue where I specify the number of clusters as e.g. 20, and it's running flowsom with nClus=20, the plot for the CV looks ok, but when it's doing the training it's only using 10 clusters, so it says Processing cluster 1... up to 10. The same with the actual normalisation, it seems to only be using 10 clusters. Any idea what's happening there?

Many thanks and best wishes, Emma

sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.10 (Final)

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale: [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1
[5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 LC_PAPER=en_GB.ISO-8859-1 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] flowCore_1.52.1 FlowSOM_1.18.0 igraph_1.2.5 dplyr_1.0.0 CytoNorm_0.0.5 optparse_1.6.6

loaded via a namespace (and not attached): [1] Biobase_2.46.0 splines_3.6.3 jsonlite_1.6.1 ConsensusClusterPlus_1.50.0 R.utils_2.9.2
[6] ellipse_0.4.2 gtools_3.8.2 RcppParallel_5.0.1 stats4_3.6.3 latticeExtra_0.6-29
[11] RBGL_1.62.1 flowWorkspace_3.34.1 yaml_2.2.1 robustbase_0.93-6 pillar_1.4.4
[16] lattice_0.20-41 glue_1.3.2 digest_0.6.25 RColorBrewer_1.1-2 colorspace_1.4-1
[21] ggcyto_1.14.1 Matrix_1.2-18 R.oo_1.23.0 plyr_1.8.6 pcaPP_1.9-73
[26] XML_3.99-0.3 pkgconfig_2.0.3 pheatmap_1.0.12 tsne_0.1-3 fda_5.1.4
[31] zlibbioc_1.32.0 purrr_0.3.4 corpcor_1.6.9 mvtnorm_1.1-1 scales_1.1.1
[36] jpeg_0.1-8.1 getopt_1.20.3 openCyto_1.24.0 flowStats_3.44.0 tibble_3.0.1
[41] generics_0.0.2 ggplot2_3.3.1 ellipsis_0.3.1 flowViz_1.50.0 BiocGenerics_0.32.0
[46] hexbin_1.28.1 mnormt_1.5-6 magrittr_1.5 crayon_1.3.4 IDPmisc_1.1.20
[51] mclust_5.4.6 ks_1.11.7 R.methodsS3_1.8.0 MASS_7.3-51.6 graph_1.64.0
[56] tools_3.6.3 data.table_1.12.8 ncdfFlow_2.32.0 flowClust_3.24.0 lifecycle_0.2.0
[61] matrixStats_0.56.0 stringr_1.4.0 munsell_0.5.0 cluster_2.1.0 compiler_3.6.3
[66] rlang_0.4.6 grid_3.6.3 base64enc_0.1-3 gtable_0.3.0 rrcov_1.5-2
[71] R6_2.4.1 gridExtra_2.3 clue_0.3-57 CytoML_1.12.1 KernSmooth_2.23-17
[76] Rgraphviz_2.30.0 stringi_1.4.6 parallel_3.6.3 Rcpp_1.0.4.6 vctrs_0.3.0
[81] png_0.1-7 DEoptimR_1.0-8 tidyselect_1.1.0

emmanuelaaaaa commented 4 years ago

Hello, I have found what the issue was so I thought I'd update here too. CytoNorm is writing a tmp folder with the FlowSom clustering of the training from prepareFlowSOM. Because I was running it in the same directory with different parameters (nClus), even though I was running prepareFlowSOM every time with the different nClus, when it came to the training with CytoNorm.train, it was finding the tmp directory already there and it was overwriting the fsom obj that I had run further above:

    if (!file.exists(file.path(outputDir, "CytoNorm_FlowSOM.RDS"))) {
...
    } else {
        fsom <- readRDS(file.path(outputDir, "CytoNorm_FlowSOM.RDS"))
        warning("Reusing previously saved FlowSOM result.")
    }

Easy fix, I went into a subdirectory Norm_nClus#, every time I run the CytoNorm.train step.

Now there is still one thing that I don't fully understand why it's happening and it looks a bit suspicious. Even though I'm training and fitting with different numbers of clusters, I get exactly the same warnings of exactly the same proportions of cells that are far away from their cluster centers. For example with nClus=5 I get:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.
2: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2382 cells (2.73%) seem far from their cluster centers.
3: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1021 cells (6.28%) seem far from their cluster centers.
4: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  4241 cells (4.58%) seem far from their cluster centers.
5: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3813 cells (9.64%) seem far from their cluster centers.
6: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3816 cells (24.13%) seem far from their cluster centers.
7: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  671 cells (2.97%) seem far from their cluster centers.
8: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2111 cells (7.73%) seem far from their cluster centers.
9: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  857 cells (2.19%) seem far from their cluster centers.
10: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1370 cells (6.58%) seem far from their cluster centers.

... And exactly the same with nClus=20:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.
2: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2382 cells (2.73%) seem far from their cluster centers.
3: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1021 cells (6.28%) seem far from their cluster centers.
4: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  4241 cells (4.58%) seem far from their cluster centers.
5: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3813 cells (9.64%) seem far from their cluster centers.
6: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  3816 cells (24.13%) seem far from their cluster centers.
7: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  671 cells (2.97%) seem far from their cluster centers.
8: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  2111 cells (7.73%) seem far from their cluster centers.
9: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  857 cells (2.19%) seem far from their cluster centers.
10: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  1370 cells (6.58%) seem far from their cluster centers.

I admit that this might be a coincidence with just the first cluster being the same but I was wondering if you have any ideas on how to explore further. Thanks, Emma

SofieVG commented 4 years ago

Hi Emma,

I think this might be because the underlying number of clusters of the FlowSOM tree is not adapted, so the mapping of the cells will be similar. The metaclustering (decided by nClus) will only happen afterwards (clustering the clusters), but this outlier detection is done on the individual cluster level (default 100 clusters if you did not adapt these parameters). If the seed is fixed, I would indeed expect this to be the same every time.

All the best, Sofie

On Tue, 11 Aug 2020 at 08:12, Emma notifications@github.com wrote:

Hello, I have found what the issue was so I thought I'd update here too. CytoNorm is writing a tmp folder with the FlowSom clustering of the training from prepareFlowSOM. Because I was running it in the same directory with different parameters (nClus), even though I was running prepareFlowSOM every time with the different nClus, when it came to the training with CytoNorm.train, it was finding the tmp directory already there and it was overwriting the fsom obj that I had run further above:

if (!file.exists(file.path(outputDir, "CytoNorm_FlowSOM.RDS"))) {

... } else { fsom <- readRDS(file.path(outputDir, "CytoNorm_FlowSOM.RDS")) warning("Reusing previously saved FlowSOM result.") }

Easy fix, I went into a subdirectory Norm_nClus#, every time I run the CytoNorm.train step.

Now there is still one thing that I don't fully understand why it's happening and it looks a bit suspicious. Even though I'm training and fitting with different numbers of clusters, I get exactly the same warnings of exactly the same proportions of cells that are far away from their cluster centers. For example with nClus=5 I get:

There were 50 or more warnings (use warnings() to see the first 50) Warning messages: 1: In FlowSOM::NewData(fsom$FlowSOM, ff) : 887 cells (2.65%) seem far from their cluster centers. 2: In FlowSOM::NewData(fsom$FlowSOM, ff) : 2382 cells (2.73%) seem far from their cluster centers. 3: In FlowSOM::NewData(fsom$FlowSOM, ff) : 1021 cells (6.28%) seem far from their cluster centers. 4: In FlowSOM::NewData(fsom$FlowSOM, ff) : 4241 cells (4.58%) seem far from their cluster centers. 5: In FlowSOM::NewData(fsom$FlowSOM, ff) : 3813 cells (9.64%) seem far from their cluster centers. 6: In FlowSOM::NewData(fsom$FlowSOM, ff) : 3816 cells (24.13%) seem far from their cluster centers. 7: In FlowSOM::NewData(fsom$FlowSOM, ff) : 671 cells (2.97%) seem far from their cluster centers. 8: In FlowSOM::NewData(fsom$FlowSOM, ff) : 2111 cells (7.73%) seem far from their cluster centers. 9: In FlowSOM::NewData(fsom$FlowSOM, ff) : 857 cells (2.19%) seem far from their cluster centers. 10: In FlowSOM::NewData(fsom$FlowSOM, ff) : 1370 cells (6.58%) seem far from their cluster centers.

... And exactly the same with nClus=20:

There were 50 or more warnings (use warnings() to see the first 50) Warning messages: 1: In FlowSOM::NewData(fsom$FlowSOM, ff) : 887 cells (2.65%) seem far from their cluster centers. 2: In FlowSOM::NewData(fsom$FlowSOM, ff) : 2382 cells (2.73%) seem far from their cluster centers. 3: In FlowSOM::NewData(fsom$FlowSOM, ff) : 1021 cells (6.28%) seem far from their cluster centers. 4: In FlowSOM::NewData(fsom$FlowSOM, ff) : 4241 cells (4.58%) seem far from their cluster centers. 5: In FlowSOM::NewData(fsom$FlowSOM, ff) : 3813 cells (9.64%) seem far from their cluster centers. 6: In FlowSOM::NewData(fsom$FlowSOM, ff) : 3816 cells (24.13%) seem far from their cluster centers. 7: In FlowSOM::NewData(fsom$FlowSOM, ff) : 671 cells (2.97%) seem far from their cluster centers. 8: In FlowSOM::NewData(fsom$FlowSOM, ff) : 2111 cells (7.73%) seem far from their cluster centers. 9: In FlowSOM::NewData(fsom$FlowSOM, ff) : 857 cells (2.19%) seem far from their cluster centers. 10: In FlowSOM::NewData(fsom$FlowSOM, ff) : 1370 cells (6.58%) seem far from their cluster centers.

I admit that this might be a coincidence with just the first cluster being the same but I was wondering if you have any ideas on how to explore further. Thanks, Emma

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/saeyslab/CytoNorm/issues/14#issuecomment-671749407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOS722XZYODVCOMW3N3AADSADOMDANCNFSM4PUNESMA .

tomashhurst commented 4 years ago

@emmanuelaaaaa that tmp folder thing is a subtle trap, so well done for noticing it! Always worth checking to see if it's still there, which might happen if CytoNorm gets interrupted.

In terms of the later error you mention:

There were 50 or more warnings (use warnings() to see the first 50)
Warning messages:
1: In FlowSOM::NewData(fsom$FlowSOM, ff) :
  887 cells (2.65%) seem far from their cluster centers.

It would be the same each time because as @SofieVG said, the first level of clustering will generate the same number of clusters (~100) and then the metaclustering will group into 5 or 20 metaclusters etc. One reason it might happen is if your data is very variable between batches, so the clusters are capturing cells that are actually quite spread out. It's possible you could try increasing the number of first level clusters (by increasing the 'grid size' -- xdim = 10 and ydim = 10 results in 10 x 10 = 100 clusters) to capture this. If you're data has small batch effects then this is more likely to be because your are capturing cells from different populations into each first level cluster, and the solution would again to try again with an increased grid size.