parklab / MuSiCal

A comprehensive toolkit for mutational signature analysis
Other
31 stars 5 forks source link

Challenges with Model Validation #61

Closed laurelhiatt closed 2 years ago

laurelhiatt commented 2 years ago
Screen Shot 2022-06-22 at 12 18 15 PM

Hello! Thank you for sharing MuSiCal. I have been able to extract de novo signatures which was really exciting, but I am now having trouble with the part of the example full pipeline where Parameter optimization with in silico validation is meant to take place. I would love your help so that I can match to the right number of signatures. Thank you for your time.

laurelhiatt commented 2 years ago

This code ran until I stopped it for several days, repeating this error and the n_components = 2 part.

Hu-JIN commented 2 years ago

Hi. Could you show the code used for the de novo extraction and signature assignment on a grid of thresholds, before the validation? How many de novo signatures were extracted from your dataset? Was it 2? The message you showed is a warning which will always be there when de novo extraction is performed for a fixed number of signatures, which is always the case during validation. So the warning message should not be the problem.

laurelhiatt commented 2 years ago

Oh! Hm. I guess I thought the warning message was a concern because it seemed like the code was looping as it was "

Extracting signatures for n_components = 2………………”

in a loop for about 50 hours. the original signature extraction took about 7 hours I think. But maybe I didn’t give it enough time?

model = musical.DenovoSig(x, min_n_components=1, # Minimum number of signatures to test max_n_components=20, # Maximum number of signatures to test init='random', # Initialization method method='mvnmf', # mvnmf or nmf n_replicates=20, # Number of mvnmf/nmf replicates to run per n_components ncpu=10, # Number of CPUs to use max_iter=100000, # Maximum number of iterations for each mvnmf/nmf run bootstrap=True, # Whether or not to bootstrap X for each run tol=1e-8, # Tolerance for claiming convergence of mvnmf/nmf verbose=1, # Verbosity of output normalize_X=False # Whether or not to L1 normalize each sample in X before mvnmf/nmf ) model.fit()

Gave me two de novo signatures with fig = musical.sigplot_bar(model.W)

And then:

thresh_grid = np.array([ 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1., 2., 5. ])

W_catalog = catalog.W print(W_catalog.shape[1]) model.assign_grid(W_catalog, method_assign='likelihood_bidirectional', # Method for performing matching and refitting thresh_match_grid=thresh_grid, # Grid of threshold for matchinng thresh_refit_grid=thresh_grid, # Grid of threshold for refitting thresh_new_sig=0.0, # De novo signatures with reconstructed cosine similarity below this threshold will be considered novel connected_sigs=False, # Whether or not to force connected signatures to co-occur clean_W_s=True # An optional intermediate step to avoid overfitting to small backgrounds in de novo signatures for 96-channel SBS signatures )

I went through these to get

And then the screenshot I sent you of the validate grid. Should the validate grid take 10x longer than the de novo extraction? Maybe I need to throw it back on our cluster…

Thank you,

Laurel

On Jun 23, 2022, at 8:41 AM, Hu Jin @.***> wrote:

Hi. Could you show the code used for the de novo extraction and signature assignment on a grid of thresholds, before the validation? How many de novo signatures were extracted from your dataset? Was it 2? The message you showed is a warning which will always be there when de novo extraction is performed for a fixed number of signatures, which is always the case during validation. So the warning message should not be the problem.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

Hu-JIN commented 2 years ago

Hi Laurel. Thanks for the additional information. The message "Extracting signatures for n_components = 2………………" appearing over and over again does not mean that it is in a dead loop. The process of validation is basically redoing the de novo extraction with n_components = 2 (in your case) for each grid point over and over again.

The code you sent looks reasonable. In this case, thresh_grid has 15 values. So the entire 2-dimensional grid has 1515 = 225 values. There will be some redundant grid points removed. But let's say we have 225 grid points. From your screenshot, each de novo extraction within validation takes ~350 seconds. So I think it'll take approximately 350 225 = 78750 seconds = 22 hours to finish. Did you let it run for 50 hours and it still did not finish? If so, that is weird to me. Could you send over the entire output for me to take a look? I just wanted to see how many grid points have been analyzed. Did you run it on a cluster and request 10 cpus?

laurelhiatt commented 2 years ago

Yes, just over 50 hours. If it works for you, I might try to re-run it over the weekend and get back to you; I’m wondering if it was an issue with our cluster being unstable because we were close to storage capacity as a university, but that should be resolved now. I don’t want to take up any more of your time if it was a computational issue on our end, and if the issues repeats, I can let you know?

Thank you, Laurel

On Jun 23, 2022, at 11:45 AM, Hu Jin @.***> wrote:

Hi Laurel. Thanks for the additional information. The message "Extracting signatures for n_components = 2………………" appearing over and over again does not mean that it is in a dead loop. The process of validation is basically redoing the de novo extraction with n_components = 2 (in your case) for each grid point over and over again.

The code you sent looks reasonable. In this case, thresh_grid has 15 values. So the entire 2-dimensional grid has 1515 = 225 values. There will be some redundant grid points removed. But let's say we have 225 grid points. From your screenshot, each de novo extraction within validation takes ~350 seconds. So I think it'll take approximately 350 225 = 78750 seconds = 22 hours to finish. Did you let it run for 50 hours and it still did not finish? If so, that is weird to me. Could you send over the entire output for me to take a look? I just wanted to see how many grid points have been analyzed. Did you run it on a cluster and request 10 cpus?

— Reply to this email directly, view it on GitHub https://github.com/parklab/MuSiCal/issues/61#issuecomment-1164697502, or unsubscribe https://github.com/notifications/unsubscribe-auth/AULUPMGGPWR3LY3JRMHKJIDVQSPFNANCNFSM5ZRG5GUA. You are receiving this because you authored the thread.

Hu-JIN commented 2 years ago

Sounds good. Let me know if it is still not resolved!

laurelhiatt commented 2 years ago

Okay so it turns out Slurm does not like me specifically, and I had a couple failed runs over the weekend, but it wasn’t a code issue, it was a cluster issue. Compromises were made.

Now I have output (hurray!!) and I was wondering if I could email you separately to check on my interpretation of the output and ask about suggested visualization strategies? If that is overstaying my welcome that is totally fair.

On Jun 23, 2022, at 1:33 PM, Hu Jin @.***> wrote:

Sounds good. Let me know if it is still not resolved!

— Reply to this email directly, view it on GitHub https://github.com/parklab/MuSiCal/issues/61#issuecomment-1164791566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AULUPMAQLOMO6APUVWDA7G3VQS3Y7ANCNFSM5ZRG5GUA. You are receiving this because you authored the thread.

Hu-JIN commented 2 years ago

Glad it worked! I'll be happy to take a look, although I cannot guarantee how much input I can provide. You can find my email here: https://compbio.hms.harvard.edu/people/hu-jin. The edu one works.

Hu-JIN commented 2 years ago

Closing this issue now.