markowetzlab / CNsignatures

This is data and code from our first paper on copy number signatures (Macintyre et al., Nat Gen, 2018).
https://www.nature.com/articles/s41588-018-0179-8?proof=trueInJun
MIT License
6 stars 2 forks source link

generateSignatures error #1

Open annahoge opened 4 years ago

annahoge commented 4 years ago

Hello, and thank you for your tool!

I am able to run CNSignatures with default signatures for a dataset of 74 ultra low-pass WGS (albeit only when I change the distribution from Gaussian to Poisson for the "changepoint" feature mixture model--otherwise I get an error).

When I try to generate my own signatures, however, the step generateSignatures gives the error "Error in rowSums(t_df) : 'x' must be an array of at least two dimensions". Additionally, chooseNumberSignatures uses >5,000% CPU even though it looks to me like it is set to use one core.

The steps I used were: cn_features <- extractCopynumberFeatures(list_of_segment_tables) generated_components <- fitMixtureModels(cn_features) sample_by_generated_component_matrix <-generateSampleByComponentMatrix(cn_features, generated_components) number_signatures <- chooseNumberSignatures(sample_by_generated_component_matrix) chosen_num_signatures <- 7 component_by_signature <- generateSignatures(sample_by_generated_component_matrix, chosen_num_signatures)

Here is what my chooseNumberSignatures plot looks like:

thursday_run_100_iter

Do you have any insight into how to get generateSignatures to work? And how to make chooseNumberSignatures use less CPU? Please let me know what other information would be helpful to you.

Thank you so much, Anna

Martingales commented 4 years ago

Hi Anna,

Thanks for your interest in our work! Let me try to unpack your situation with a few questions:

1) You have 74 shallow WGS (sWGS). Are they ovarian cancer? => If so, I suggest you use the pre-defined 7 ovarian copy number signatures.

To calculate the exposures you would need to run three functions:

cnFeats =  extractCopynumberFeatures(list_of_segment_tables)
SxCMat = generateSampleByComponentMatrix(cnFeats)
expMat = quantifySignatures(SxCMat)

This assumes that your working directory is the folder of the CNsignatures package. If not, you would need to specify the allComponents option of the function generateSampleByComponentMatrix. You can find the pre-defined components under data/component_parameters.rds. Same for the quantifySignatures function where you would need to supply the pre-defined signatures with the option component_by_signature. You can find them under data/feat_sig_mat.rds.

=> If not, derivation of new signatures could be done given a few constraints.

2) How does your copy number look like? What tool do you use to generate the segmentation? Do you use unrounded copy number?

=> For sWGS, we currently recommend QDNAseq: http://bioconductor.org/packages/release/bioc/html/QDNAseq.html

=> Since you talk about using Poisson models for the changepoint distribution I guess you use rounded copy number (aka integers: 1, 2, 3...). Unrounded copy number (aka float: 1.23, 2.49, ...) is important to us because the rounding hides interesting information: 1.51 and 2.49 would be rounded to 2 masking potential subclonal gains or losses.

Let me know how you get along.

annahoge commented 4 years ago

Hi Ruben,

Thank you so much for your prompt and thoughtful reply!

Our samples are from prostate cancer. I am using ichorCNA off-target to call copy number (a targeted sequencing panel was used for these tumors and ichorCNA off-target treats the off-target reads like sWGS; https://github.com/GavinHaLab/ichorCNA_offtarget).

Thank you for your note about rounded vs. unrounded copy number. I was previously using rounded copy number, but switching to unrounded copy number does indeed fix the changepoint distribution problem I was having. CNSignatures with the pre-defined 7 signatures now runs as expected.

When trying to derive new signatures, however, I am still getting the same rowSums error as before. This time I ran with chosen_num_signatures <- 6 (I don't fully understand how to identify the "point of stability in the cophenetic, dispersion and silhouette coefficients", but picked 6 as it looks to be the "maximum sparsity achievable above the null model for the basis matrix"--https://www.nature.com/articles/s41588-018-0179-8/figures/8). Here is my chooseNumberSignatures plot this time:

Screen Shot 2020-05-08 at 11 43 53 AM

When I try to inspect my component_by_signature variable that I am passing in to generateSignatures, I see:

component_by_signature Object of class: NMFfitX1 Method: brunet Runs: 1000 RNG: 10407L, 441882131L, 319300664L, -879503143L, 1614850758L, 121294479L, -1792824444L Total timing: user system elapsed 954.070 14.329 969.234

Thank you so much for your help, Anna

Martingales commented 4 years ago

Hi Anna,

As far as I see it, there are two challenges in your analysis: Prostate as a cancer type and the computational aspect of deriving signatures. At this point it might be important to discuss your sample cohort because most downstream problems are alleviated once you have a nice grip on your samples.

Here are a few recommendations to tackle both challenges: 1) The presence of chromosomal instability (CIN). Prostate cancer samples, as far as I am aware, do suffer from CIN but to a lesser degree than ovarian cancer. Do all of your samples have clear presence of CIN? How do your copy number profiles look like? We currently use 20 CNAs per sample as a cutoff of detectable CIN. We will publish a preprint soon which will explain how we derive this threshold. Chose only samples with 20 or more CNAs gives you a chance to avoid noise because in the end you are interested in mutational processes resulting in CNAs. Samples with only a few CNAs might carry spurious CNAs without a proper mutational process being active in the background. This could skew your analysis.

2) Segmentation algorithms. I have no experience with ichorCNA. Previously, we have used CopywriteR for segmentations based on off-target reads with generally good results compared to shallow WGS. If you are interested, @lm687 might be able to help you.

But, and this is a huge problem, segmentation algorithms differ wildly in their results. This has again has downstream effects on signature generation. This problem is becoming so important that our lab is developing its own segmentation algorithm for shallow WGS. That doesn't help you right now but it should give you an indication of how important proper segmentation is.

3) Chose signatures. To give you a more practical advice regarding your chooseNumberSignatures plot: 6 seems like a sensible choice. 5 may also be a good choice.

As you already mentioned and how the paper briefly describes, a good start is to look at the sparseness plot and see at which factorisation (K) the randomness overtakes the observed sparseness: that is the dashed red line versus the full red line. This gives you an upper bound on how many signatures you might have. The intuition behind this plot is, that if the basis plot (the signature definitions) carry more sparseness than expected by chance, then we capture biological signal. 6 is quite close but still more sparse than random matrices, so 5 might be a good choice as well. Given the other plots, there is little difference between 5 or 6 signatures. To me both solutions would make sense from a computational point of view.

The NMF per se is a computational exercise, it will always give you an answer. Whether it biologically useful and tells you something about prostate cancer biology depends on the data quality and interpretation of the results.

4) Rowsums Error code. I don't have your code or any example data to reproduce the error, so I'm afraid I cannot help you with this at the moment.

Hope that helps.