seb-mueller / chlamy_locus_map

Small RNA Locus Map for Chlamydomonas reinhardtii
GNU General Public License v3.0
1 stars 0 forks source link

Running MCA #15

Open nmatthews323 opened 5 years ago

nmatthews323 commented 5 years ago

Creating a new issue for MCA running.

I have separated the MCA analysis (to work out which clusters and dimensions to use, and takes ages) and actually running the final MCA and clustering according to those settings, which is relatively quick and is mostly a presentation and tidying exercise.

Once have the correct annotated gr object to use will run the analysis script.

nmatthews323 commented 5 years ago

Hope you had a good Christmas, and have a nice NYE.

I've worked through running the MCA and doing the analysis to determine number of dimensions from the MCA to include, and the number of clusters to compute. There are now three scripts. One to do stability analysis etc. which takes a while (~2 hours) . A separate script takes the analysis output and plots them. Finally there is the core MCA script which performs and plots the final MCA and HCPC based on the settings determined by the previous analysis. You'll find all the latest outputs and figures in "LociRun2018_multi200_gap100_90c7213_MCAOutputs_05c5bb5" including the heatmaps and chromosome tracks.

I've updated my original analysis script to use the code used in the arabidopsis paper. This uses two metrics to determine the dimensions - variance explained for each additional dimension, and stability plots. Number of clusters is determined by stability plots, observed vs expected sum of squares error, and comparison of clustering using NMI. I don't understand the stats behind these in-depth, but it's using the same approach (and code) as was used for the Arabidopsis paper.

I think 6 clusters and 7 dimensions looks like the appropriate settings:

In this one dimensions 6-10 all look reasonable: VarianceDimensions.pdf

In stability plot 6 clusters 7 dimensions looks nice and stable: StabilityPlotSmall.pdf

6 clusters looks pretty good here: SumSquares.pdf

And here... NMIPlots.pdf

Probably be good to have a chat through on the phone as well, but generally looks niceeee. I'll be back in the office properly on the 2nd Jan but then off skiing on the 12th.

nmatthews323 commented 5 years ago

Hi Seb, hope you're doing well!

Have you managed to take a look at these diagnostic plots? If you're ok with them I can run the final MCA start making some nice plots.

seb-mueller commented 5 years ago

Just came round to look at it (the first couple of weeks with a new born is more time consuming than I thought.. ). Anyway, that's really good work and having looked at the plots I'll agree with your verdict, there is a nice elbow at 7-8 dimensions (whereas 7 or 8 doesn't impact the subsequent cluster stabilities, so I'd agree going with 7 dimenison). Also the 6 cluster solution seems legit to me since coming up consistenlty in most plots! Only the 3 cluster seems an a bit of an alternative, but then looking at the featureMatrix_gr, it apears all 6 cluster seem to have quite a few clear distinctive features, so I'd be happy to roll with this! Only if it's not much effort I'd be interessted how the 3 cluster solution might look like? Happy to chat on the phone anytime on how to proceed, maybe this week?

nmatthews323 commented 5 years ago

Hi seb! I'm sorry I didn't realise you had your new kid, congratulations!!!! Boy or girl?

I can chat tomorrow if that works, but also no pressure at all to work on this, these first few months are important!

I'll compute 3 clusters for comparison, should be pretty quick.

nmatthews323 commented 5 years ago

The main one looks very sparse, I think it's being determined by a relatively small number of factors (this one only selects "significant" factors. featureMatrix_gr3.pdf This one looks more interesting, particularly cluster 2 with association of miRNAs, DCL3 and AGO3 dependence, phasing and strand bias: featureMatrixb_gr3.pdf

I think I'd personally probably still go with 6 clusters, but we have script from the arabidopsis paper to compute the "cluster hierarchy" into which we can put discussion of the relationship between the clusters... cluster_hierarchy.pdf

seb-mueller commented 5 years ago

I wasn't quite sure who I told, so sorry for the heads-up ;) It's a girl, Frida, all of us are doing just fine :)

Anyway, thanks for running this and I agree with you, I'd also go for the 6 cluster solution. Today is a bit tight for me, are you about in the afternoon (anytime) for a quick chat?

nmatthews323 commented 5 years ago

I'm really sorry I completely dropped off the radar, the week packed up and I completely forgot about us having a chat.

Are you free tomorrow (Monday) at all? I'm free 11am-2pm and 3pm-5pm.

seb-mueller commented 5 years ago

No worries at all. I'd call you around 4pm then, speak you later!

nmatthews323 commented 5 years ago

Great, speak to you soon!

On Mon, 4 Feb 2019, 2:35 pm seb-mueller <notifications@github.com wrote:

No worries at all. I'd call you around 4pm then, speak you later!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/seb-mueller/chlamy_locus_map/issues/15#issuecomment-460271138, or mute the thread https://github.com/notifications/unsubscribe-auth/AkvVh_0rgHD9iDNs6piBpZaXuqkjgZDzks5vKEVHgaJpZM4ZRYmH .