novigit / broCode

self made scripts
3 stars 2 forks source link

alignment_pruner.pl --chi2_prune parameter #1

Open skeiii opened 3 days ago

skeiii commented 3 days ago

Hi, I have been constructing an Alphaproteobacteria tree recently and using the alignment_pruner.pl to trim the 20% most heterogeneity sites.

According to the script,

Use the chi2 statistic to choose columns to prune. This will first order
all of the columns by the chi2 statistic by comparing the chi2
statistic for the alignment with and without each column. Then the option
specifies how to remove columns:

    half    Remove half of the sites (starting with the most biased).
    f#      Remove sites until # fraction of sites remains, half can be
            specified as f0.5
    n#      Remove sites until only # number of sequences show significant bias.
    min     Remove sites until a minimum of sequences show significant bias.
    plot    Will print statistics to the screen suitable for plotting. It will
            contain 4 columns: idx, number of biased sequences, chi2 delta for
            this column and the names of the biased sequences.

And therefore, I use: alignment_pruner_broCode.pl --file input.faa --chi2_prune f0.8 > pruned20_Alpa117.fasta

However, as compared to my original input file, instead of remain 80% of the sites, it only left with 20%. Original file: 18823 f0.8: 3765

I am wondering, am I using the wrong command for trimming the most compositional heterogeneity sites, or the description of the text in f# is not correct. To trim 20% of the sites, it should be f0.2.

novigit commented 2 days ago

Hi! You're absolutely right, it seems like the description was wrong. It seems to actually remove the 80% most heterogeneous sites. Like you deduced, to get rid of the 20% most heterogeneous sites, you need to run with f0.2.

I was not the original author of this script though, and feel weary of changing it without his knowledge. It's also in Perl, which is not really my fortee.

You can also check whether the total CHI2 score has reduced by applying ./alignment_pruner.pl --file input.faa --chi2_test | grep TOTAL on the original alignment and the pruned alignment. It should have been reduced.

On another note, you may also be interested in this script, if you want to study compositional bias problems: https://github.com/Dalhousie-ICG/icg-shared-scripts/blob/main/visualize_compositional_bias.py

It is able to quantify per taxa which amino acids are enriched or depleted relative to the mean, and visualize it neatly in a bar chart, that is aligned to a phylogenetic tree.