Resampling Clarification

mcale6 commented 1 year ago

I get a MSA with a diversified ensemble. The output is in .efa format, which I want to resample from.

However the the column width of the resampled MSA is different than the length of the ensembled MSAs. Why?

I tried to vary the parameters (cc, gap fraction) but the resampled MSA will always have fewer residues. (Half the residues than the average sequence length in the MSA).

rcedgar commented 1 year ago

I don't understand what you mean by "column width" or "length" of an MSA. In general, MSAs in an ensemble will have a different number of columns because the gap placement varies. In the simplest case where there are no gaps anywhere, all MSAs in the ensemble will be identical and (obviously) will therefore have the same number of columns (call this L). In this case, the resampled MSA will also have L columns, but will differ from the (identical) ensemble MSAs because the column order will be randomized with some columns appearing twice and some will not appear.

In more typical cases where ensembles are useful, there are lots of gaps. Here, the resampling process starts by discarding all columns with >50% gaps, because the user probably wants to do tree estimation, and most tree estimation algorithms don't handle gaps very well. The number of columns in the resampled MSA is then set to U = |D|, where D is the set of distinct columns. Two columns are the same if they contain the same letters from the same positions in each sequence, otherwise two columns are are distinct even if they contain the same letter from each sequence. The resampled MSA is constructed by selecting columns from D at random, with replacement, U times.

mcale6 commented 1 year ago

Thank you.

I was thinking that one samples columns from the ensemble with a fixed length L (= max column length of the ensemble MSA), where at each column position y one column is randomly chosen from the ensemble from the same column position y of the ensemble.

However this does not to capture the variation of the input data and does not account for different column length in the ensemble. Also I would loose information about conserved columns.

rcedgar commented 1 year ago

Sorry, my description was not correct. The sampling from D is weighted by the frequency of the column in the ensemble, with the net result that it works in a similar was as you suggest it should (except for discarding gappy columns first). I use K = median number of columns with CC >= 0.5, your L = max number of columns would also be reasonable. You can change the column quality threshold using the -minconf option, e.g. -minconf 0.5 sets the default. This idea behind this design is to generate a resampled ensemble which works well with typical tree software without further processing.

mcale6 commented 1 year ago

I see, thanks for clarification!

Just to be sure: The sampled position of the column y in the resampled MSA, comes from the same column position y from one of the ensemble (where the column with higher cc is sampled more likely). The position y cannot be sampled again, but the whole MSA (without position y) can?

rcedgar commented 1 year ago

No, the re-sampling is with replacement, as with Felsenstein bootstrap, for the same reason -- this is standard in bootstrap estimation. Unfortunately, my original explanation above was confusing/wrong, because I forgot how it worked and looked at the code without thinking very carefully about it. Here is (hopefully) a better explanation.

Input: All MSAs in the ensemble. Output: One re-sampled MSA.

Discard gappy columns.
Build the output MSA by selecting K columns from the remaining columns, with uniform probability, with replacement.

This has the effect of preferentially sampling higher confidence columns, because these occur more often in the ensemble. Hence, the mean accuracy of columns in the re-sampled MSA will tend to be higher than Felsenstein bootstrap samples on any single MSA.

The value of K is somewhat arbitrary; I use the number of columns the MSA with highest overall confidence.

Larger K should give more accurate results in downstream analysis, at the expense of larger input data which may increase the computational cost.

This procedure is actually very simple, you could implement it yourself n a few lines of python with an ensemble as input and hack it any way you prefer. The C++ source code is much more complicated because it attempts to optimize the sampling to be very fast, for high-throughput applications.

rcedgar / muscle

Resampling Clarification #58