grambank / grambank-analysed

3 stars 0 forks source link

tree plots of features with most phylo effect #60

Closed HedvigS closed 2 years ago

HedvigS commented 2 years ago

By request from @RustyGray I made tree plots of the 5 features that have the highest scores for phylo-effect in the dual INLA model. The tree plots feature ancestral state reconstruction, just "plain" ML-reconstruction (ape::ace( method = "ML", type = "discrete", model = "ARD"))

The script for doing this is here.

The trees have circa 1400 tips (missing value tips were pruned), I've put in one example below for GB090 which is the feature with the highest mean phylo effect in the dual model. I must admit that I was surprised to see the feature values so spread out and not mot clustered.

What do you think of showing these tree plots of the most phylo effected features @SamPassmore and @rdinnager ? I know that INLA doesn't do ASR technically and if so it's not the same as the ape::ace() function call I made for these plots. Is there some other better way of going about this if we do do it?

1 = purple 0 = orange

most_signal__GB090_S-ArgPfxV

QuentinAtkinson commented 2 years ago

I think these plots are good for us to visualise and check things, but I’m not a fan of putting them in the paper. They emphasise the single MCC tree, which includes massive uncertainty not captured in the plot, we aren’t actually doing ancestral state reconstruction, and given we have another paper that explicitly focusses on the tree, I’d rather not emphasise it here.

SamPassmore commented 2 years ago

Hi Hedvig,

There are lots of options when it comes to Ancestral State Reconstruction, I guess it depends on how intensively we want to estimate it. Bayestraits would be possibly the most intensive estimation approach. Or what you have done is probably the least intensive.

I guess the main question is what do we want to know or show with this plot? As it stands it is pretty hard to figure out what is going on. And I am not sure if there is much we can do to show a simple pattern using a tree of this size, regradless of the feature. If a tree is a necessary plot, it might be nicer to find a small clade that can exemplify whatever point you want to make about inheritance.

A side note: I thought there was another project which was looking at ancestral state reconstruction using features that were "more phylogenetic". Is something like this going to step on those toes?

SamPassmore commented 2 years ago

Oh I see Quentin replied while I was writing - I also agree with the point about tree uncertainty.

rdinnager commented 2 years ago

Even if this figure doesn't go in the paper it's still a good sanity check. But something doesn't seem right with this. I'd like to have a go at this if you don't mind? I can do it tomorrow and I'll read over my section of the manuscript then as well. And INLA can do ancestral reconstruction, without too much trouble, I can write some code to do that so at least the aces are based on the same model that the signal is calculated from.

On Tue, 5 July 2022, 6:49 pm SamPassmore, @.***> wrote:

Oh I see Quentin replied while I was writing - I also agree with the point about tree uncertainty.

— Reply to this email directly, view it on GitHub https://github.com/grambank/grambank-analysed/issues/60#issuecomment-1175577773, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNLQOOTZY7AR4WVPLU22G3VSS3ZBANCNFSM52W2NDWQ . You are receiving this because you were mentioned.Message ID: @.***>

SimonGreenhill commented 2 years ago

^ what @rdinnager said -- the pies are all identical, so there's something messed up in the plotting. I like the idea of making the model and the ace consistent.

HedvigS commented 2 years ago

Hello everyone,

I'll try and answer each matter in turn.

@SamPassmore said There are lots of options when it comes to Ancestral State Reconstruction, I guess it depends on how intensively we want to estimate it. Bayestraits would be possibly the most intensive estimation approach. Or what you have done is probably the least intensive.

Indeed. Originally this script was used to visualise the way the caper::phylo.d() estimate of phylo signal is done and therefore it was with Felsenstein's least squares ("pic" in ape). Russell wanted ML here so I shifted to that. We could do SCM or other approaches too, but since we're not putting a lot to weight on it and we're in a hurry I just went with what Russell wanted. If we can, I'd like to avoid using BayesTraits for this project.

@QuentinAtkinson said: I think these plots are good for us to visualise and check things, but I’m not a fan of putting them in the paper. They emphasise the single MCC tree, which includes massive uncertainty not captured in the plot, we aren’t actually doing ancestral state reconstruction, and given we have another paper that explicitly focusses on the tree, I’d rather not emphasise it here.

Understood, noted. Yes, indeed if we wanted to do ASR properly we should at least sample the posterior trees. This might blow up beyond the scope of this paper and unnecessarily step on toes. I think they're my toes right though? I'm the only person doing ASR with grambank features right now, right? I'm okay with this, totally. The question is different enough from what I'm doing :)

@rdinnager said: Even if this figure doesn't go in the paper it's still a good sanity check. But something doesn't seem right with this. I'd like to have a go at this if you don't mind? I can do it tomorrow and I'll read over my section of the manuscript then as well

Yes sure! I also thought it was very odd, I was quite confused myself. I originally had this code set up for the phylo.d viz so using pic instead of ML. Something seems to be odd, I'm also confused at the pies.

I remember us talking about using the ancestral states that INLA already predicts, that would indeed be even more elegant! I'd love that. I don't think that we currently save the necessary information in this script for us to work that out, , right? So if you modify either INLA_all_models.R or strip_inla() to get at the asr that would be super cool.

Before doing that, perhaps we should first make sure we're all on the same page about doing ASR at all. As Quentin pointed out, it'd be better to do on the posteriors and if we want to do it proper then... it will take a bit of time to get all of our ducks in line.

HedvigS commented 2 years ago

^ what @rdinnager said -- the pies are all identical, so there's something messed up in the plotting. I like the idea of making the model and the ace consistent.

I think something went wrong in the way I had written the plot script when I shifted from pic to ml.

HedvigS commented 2 years ago

I'm going to make one change in INLA_all_models.R so that it just runs the dual, it'll be quicker for sure :)

HedvigS commented 2 years ago

Okay done. There is now a script called "INLA_dual_model.R" which works exactly like the other one with the CLI arguments etc but it only runs the dual model. I've updated the makefile accordingly as well.

rdinnager commented 2 years ago

Hi @HedvigS . Since we only want to do this for a few features, I think it is easier to just rerun the model for those features. We can use a warm restart and use the posterior means for the hyperparameters as starting values, so it should converge pretty quickly. Getting the information for ancestral reconstruction would require changing the formula and the data structure, so it doesn't seem worth it to redo the model for every feature, at least for the time being.

rdinnager commented 2 years ago

So I've taken a look at the script for this, and the weirdness can be explained simply by the ordering of the tips and the data getting out of sync, an issue I encounter all the time. I've fixed that and I just rerun with the same method of ASR. The top five are below. I'm convinced these are indeed features with pretty high phylogenetic signal based on these. However, if we want to include any of these in the paper (even in the Supplement), I think we should use INLA to calculate the ancestral states estimates, which I would be happy to do. But if we just want to have a sanity check, then I'm already satisfied. But if anyone thinks we should go further at this point, let us know.

most_signal__GB081_VInfix

most_signal__GB090_S-ArgPfxV

most_signal__GB092_A-ArgPfxV

most_signal__GB198_NUMGender

most_signal__GB431_POSSPfxPosd

And just for comparison, below are the two lowest phylogenetic signal features:

most_signal__GB166_PaucalBound

most_signal__GB319_TrialFree

HedvigS commented 2 years ago

Thank you @rdinnager ! That makes so much sense. i'm sorry and grateful :)

HedvigS commented 2 years ago

I'll leave it to @QuentinAtkinson and @RustyGray to decide if we want these plots in the paper. If we do, we have to re-run the dual INLA on 113 features and use the INLA asr I think, I agree with @rdinnager there.

HedvigS commented 2 years ago

In the meantime could I get the code where you've corrected my plot please @rdinnager :)

HedvigS commented 2 years ago

Thank you @rdinnager for the code in #61 ! Much appreciated.

What do we need to save from inla-objects to do the asr?

rdinnager commented 2 years ago

Thank you @rdinnager for the code in #61 ! Much appreciated.

What do we need to save from inla-objects to do the asr?

We would need to change the fitting setup as well. We would need to add some rows to the dataset, one for each ancestral node. These would have NA values instead of zeroes or ones. Then make sure that we use control.predictor=list(compute=TRUE) in the inla() call. After that the predictions for the ancestral nodes will be in fit$summary.fitted.values.

HedvigS commented 2 years ago

Thank you @rdinnager for the code in #61 ! Much appreciated. What do we need to save from inla-objects to do the asr?

We would need to change the fitting setup as well. We would need to add some rows to the dataset, one for each ancestral node. These would have NA values instead of zeroes or ones. Then make sure that we use control.predictor=list(compute=TRUE) in the inla() call. After that the predictions for the ancestral nodes will be in fit$summary.fitted.values.

Okay, sounds feasible. Maybe we should do it for the dual and trial models only?

QuentinAtkinson commented 2 years ago

The new plots look reassuring. Interesting that the low phylo signal traits have very biased distributions of 0's vs 1's. Might warrant a plot of relative proportion of 1's against phylo signal just to check that isn't all we're recovering...which I doubt. Good to check all this but I still worry about us focusing in the paper on ancestral state reconstructiion on the phylogeny at all - it really draws attention to specific relationships in the tree and specific trait history, when the paper is mostly about broad patterns. Could get us bogged down with reviewers.

HedvigS commented 2 years ago

The new plots look reassuring. Interesting that the low phylo signal traits have very biased distributions of 0's vs 1's. Might warrant a plot of relative proportion of 1's against phylo signal just to check that isn't all we're recovering...which I doubt.

I was thinking the same thing. I'll make a scatterplot matrix if for nothing else just for us to look at.

HedvigS commented 2 years ago

I've made scatterplot belows (just for internal perusing, not for publication). It shows the proportion of 1s per feature compared to the mean_phylogenetic and mean_spatial effects. I became a little bit concerend as I saw that the feature with the most spatial effect was GB319 which is one of the features with the fewest presences. However, it appears the overall pattern is different.

Surprising, but relieving. We could compare more than the means if someone wants that.

Code is here. splom_dual

HedvigS commented 2 years ago

I've inserted tree plots for the Grambank features with the highest phylo effect and maps for the three features with the highest spatial effect.

For now the ASR is ML, but I'd be very excited to explore INLA ASR with @rdinnager if he has time for that :D

QuentinAtkinson commented 2 years ago

This is interesting. No major concerns. Phylo signal is clearly lower for very rare (in terms of 1's but not 0's) traits. This may be due to Grambank feature selection allowing some quite rare traits only if they are not isolated to a particular lineage - this could create an artefact where very rare traits never have high phylo signal. I don't see this as a major issue, it just describes the data we have created.

HedvigS commented 2 years ago

I've made scatterplot belows (just for internal perusing, not for publication). It shows the proportion of 1s per feature compared to the mean_phylogenetic and mean_spatial effects. I became a little bit concerend as I saw that the feature with the most spatial effect was GB319 which is one of the features with the fewest presences. However, it appears the overall pattern is different.

Surprising, but relieving. We could compare more than the means if someone wants that.

Code is here. splom_dual

@RustyGray do you want this plot in the ms supplementary?

QuentinAtkinson commented 2 years ago

Hedvig, I think this was really just a sanity check and it probably doesn't need to go in the paper.

HedvigS commented 2 years ago

@QuentinAtkinson yes I agree :)

HedvigS commented 2 years ago

I'm refining the tree plots a bit further. We're adding some clade labels to help orient the reader inside the tree.

I decided to add labels for: Otomanguean, UA, IE, Austroasiatic, Tibeto-Burman, Austronesian, Atlantic-Congo and Afroasiatic.

I'm having some techincaly hiccups with the image generator though, sometimes not all labels show up. In this PNG for example, the clade label for austronesian is missing.

most_signal_phylo_1_GB090_S-ArgPfxV

rdinnager commented 2 years ago

I like the labels. But if this is going in the paper, can we not use pie charts? Instead, we can colour the edges according to the reconstructed probability of the subtending node. I can help with that if you agree? I can't stand pie charts, and for only two states they are unnecessary, and they are all overlapping anyway. Just colouring the edges would also take the attention away from the nodes and thus avoid emphasizing ancestral reconstruction, which is really just being done here for visualisation purposes. Also, I am happy to work on the INLA asr for this, but I won't have time until next week.

HedvigS commented 2 years ago

I fixed the problem with the non-occurring clade labels.

sure, we can do that. how would that work with different proportions?

HedvigS commented 2 years ago

Okay, let's stick with the ML asr now and we can think about INLA asr next week :)

HedvigS commented 2 years ago

I've just finished fiddling with clade labels etc so I'm leaving this script alone for today

RustyGray commented 2 years ago

Yes, agreed. Thanks, R.

Russell Gray, FRSNZ Director, Max Planck Institute for Evolutionary Anthropology Head of the Department of Linguistic and Cultural Evolution Departmental administrator: @.*** https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/russell-gray/ https://scholar.google.com/citations?hl=en&user=sksPd1cAAAAJ

On 8. Jul 2022, at 17:44, Russell Dinnage @.***> wrote:

I like the labels. But if this is going in the paper, can we not use pie charts? Instead, we can colour the edges according to the reconstructed probability of the subtending node. I can help with that if you agree? I can't stand pie charts, and for only two states they are unnecessary, and they are all overlapping anyway. Just colouring the edges would also take the attention away from the nodes and thus avoid emphasizing ancestral reconstruction, which is really just being done here for visualisation purposes. Also, I am happy to work on the INLA asr for this, but I won't have time until next week.

— Reply to this email directly, view it on GitHub https://github.com/grambank/grambank-analysed/issues/60#issuecomment-1179131762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPFD3P3BUGW7WAH2GG3VTBEF5ANCNFSM52W2NDWQ. You are receiving this because you were mentioned.

HedvigS commented 2 years ago

My understanding: Russell is positive to edge colouring instead of pie charts and is also approving of doing INLA ASR next week.

RustyGray commented 2 years ago

Yes

On 8. Jul 2022, at 18:23, Hedvig Skirgård @.***> wrote:

My understanding: Russell is positive to edge colouring instead of pie charts and is also approving of doing INLA ASR next week.

— Reply to this email directly, view it on GitHub https://github.com/grambank/grambank-analysed/issues/60#issuecomment-1179166317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPAJQIGGVLCDEOHTJJTVTBIZRANCNFSM52W2NDWQ. You are receiving this because you were mentioned.

rdinnager commented 2 years ago

Okay, I am going to work on this today. Just a few caveats to keep in mind. The INLA ASR is essentially predictions made at the internal nodes. When doing this, it doesn't really make sense to not also make predictions at the terminal nodes. This then also makes it a plot that can show how well the data fits, if we also plot the observed data. This I think is a good thing, but there is an issue with the dual model. If we use the dual model to do the predictions (which makes sense since we are showing the top phylogenetic features in the dual model), then our prediction will be conditional on the spatial effects. This means if the spatial effect is reasonably strong, it is possible the fit of the observed data to the predictions will be affected. I am going try with the dual model and see how it looks, but we might also consider using the phylo only model for the ASR, just for visualization.

HedvigS commented 2 years ago

@rdinnager I understand.

I think that's better though, that we use the INLA ASR from the dual model, since those are also the main results we're presenting. I understand what you mean that for the viz it's a bit funny if there's the dual model behind it but a tree plot, but I think it's worth it for the "elegance" of using the same approach throughout.

HedvigS commented 2 years ago

By the way, I changed some things around in our INLA runs so that now we're running the dual and trial model only and they can be run with one out of 3 different spatial precision matrices. This is the relevant script.

If you call it

Rscript spatiophylogenetic_modelling/analysis/INLA_multi_models.R "real" 1 113

then it runs on the 113 real features (not simulated) and with kappa_2_sigma_1.15 precision matrix. if you want another precision matrix that can be the 4th CLI argument.

HedvigS commented 2 years ago

Also by the way, @rdinnager . right now I've changed the plotting script so that it writes out the entire feature "name" (i.e. "question") instead of just the abbreviation. Unfortunately, I've run into a problem with the plotting margins, as you can see below the margins currently cut off the top part of the title.

I haven't yet solved this and I wanted to make you aware of this as you're tweaking the plotting script etc so you don't get surprised when they are spit out like this. It should be an easy fix, I just haven't found it yet

most_signal_phylo_2_GB092

rdinnager commented 2 years ago

most_signal_phylo_2_GB090

Okay, so here is what I have come up with for a new plot. The edges are coloured according to the nodes as reconstructed by inla (where the predictions are on the probability scale, that is, the probability of a 1 for the feature). I've just used the classic viridis colourscale but this could easily be tweaked (for example the {scico} package has a lot of nice colour-blind friendly scientific palettes). I've plotted the observed feature in a rather simple way, we just have a single solid black circle if the language is 1 for the feature, otherwise it is blank, and so the zero is inferred from absence. I have also plotted missing data as grey circles, since inla also made predictions for these missing tips while I was doing the ASR, so we might as well include that too I thought. Anything else can be tweaked too of course. Let me know what you think of this way of plotting, and any requested changes.

@HedvigS I will send a PR with this code, so that you can tweak the plots if I am unavailable, and so you can run it for all the desired features (I've only run it on the one, because the prediction procedure in INLA is pretty slow for this model).

BTW the legend doubles as a scale for the branch-lengths, its length is equal to 75 units of branch-length. This can also be changed to any desired length. I don't know what the branch-length units represent on this tree, but we could add the units to the legend label too if we want.

HedvigS commented 2 years ago

hi @rdinnager !

What do you think of regging the code such that we don't need to run INLA again, but that what we spit out the first time we run INLA can go into building the ASR plots?

rdinnager commented 2 years ago

That would make sense. I'm happy to do it but my daughter is out of school this week, so I don't have a lot of time. But next week I should have plenty of time! If you are in a hurry, you could base the modifications off of what I've done here: https://github.com/grambank/grambank-analysed/blob/bef714dce365f17cf5bd85ee8867d81358c118f9/R_grambank/spatiophylogenetic_modelling/analysis/featurewise/INLA_results_plots_ASR_maps.R#L147-L189

HedvigS commented 2 years ago

@rdinnager understood!

I changed strip_inla before to save the summary fitted values. So I guess what's needed now is just reading that into the asr plot script, right? I think I can do that :D

rdinnager commented 2 years ago

The model fitting code will also have to be changed to add missing data for ancestral nodes (lines 153-162 above), so that INLA will make predictions for them. Also, make sure this argument is in there: control.predictor=list(link = 1, compute = TRUE), otherwise the predictions won't be computed.

HedvigS commented 2 years ago

The model fitting code will also have to be changed to add missing data for ancestral nodes (lines 153-162 above), so that INLA will make predictions for them. Also, make sure this argument is in there: control.predictor=list(link = 1, compute = TRUE), otherwise the predictions won't be computed.

Ah, true. Good point. Is this going to be slowing down the process significantly? Because then it might be better to just do a small subset like you had set up already :)

rdinnager commented 2 years ago

Yes, it will slow things down considerably, so definitely more efficient to just rerun the models on a smaller subset. So shall we leave the code as is then?

HedvigS commented 2 years ago

Yes, it will slow things down considerably, so definitely more efficient to just rerun the models on a smaller subset. So shall we leave the code as is then?

Okay, yes let's do that. Thanks for clarifying.

I changed around a bit in the qs files, right now the script that runs all the analysis only runs the dual and trial models. This means that currently the ASR plot script pulls out the wrong things from the qs-objects. I started rewriting it to address that, and I found myself making a for-loop and then I thought... perhaps I should not.. make things more complicated in your script ^^.

Here's a nextcloud folder with the output from the main analysis . We're interested in the ones that have kappa_2_sigma_1.15_pcprior0.1, this is the main spatial params settings we're using and the default prior.

Would you have time this week to help rejig the script so that the lapplys pull out the right objects?

HedvigS commented 2 years ago

I think I fixed it... running now :D

HedvigS commented 2 years ago

@rdinnager I've done some more work on this today. It's all running well, thanks for the code you wrote. I've split the script up so that the thing that takes time (running the INLA with ASR) outputs some files that other scripts can access. Otherwise as soon as I want to change one parameter of the plot I have to wait... a long time :D

HedvigS commented 2 years ago

While I'm waiting for the results from my most recent runs to make these ASR-plots ^^, can I ask you something @rdinnager and @SamPassmore ?

The way we're estimating ancestral states here, it's using the dual process model, i.e. both spatial and phylogenetic data on relatedness. What... does that mean for the ancestral states.. . What @rdinnager has done is insert empty slots for the ancestral nodes in the tree right? But what spatial information does the model get for them? It's nothing, right?

SamPassmore commented 2 years ago

I would weight RD's opinion over my own, but my understanding is that the ancestral state reconstructions work in the same way as the missing values in extant languages within the response.

In that case the missing data is imputed from the models posterior predictive distribution. That is to say, we have a data generating process in the form of the spatial and phylogenetic relationships, with their posterior parameters distributions estimated from the observed data. Given what we know the spatial and phylogenetic relationships of the missing data (i.e. ancestral nodes or otherwise) we can generate an estimate for the missing data from the posterior predictive distribution.

We don't know the spatial locations of the ancestral nodes, however. So I am not sure what happens there - possibly these are also imputed?

But to answer the question more directly: Yes, the ancestral nodes estimates are influenced by the spatial effects in the model. You can see RD mentioning that here: https://github.com/grambank/grambank-analysed/issues/60#issuecomment-1181692737

HedvigS commented 2 years ago

Yes that makes sense. Anyway, I think it's really neat and if you two do want to do a paper on language data and INLA in future I think a lot of linguists would be really excited by this!!