moshi4 / pyMSAviz

MSA(Multiple Sequence Alignment) visualization python package for sequence analysis
https://moshi4.github.io/pyMSAviz
MIT License
80 stars 14 forks source link

msa without gaps #1

Closed cx994 closed 1 year ago

cx994 commented 1 year ago

I think it is necessary to provide a function to draw msa without gaps

moshi4 commented 1 year ago

MSA without gaps cannot be called MSA because it lacks alignment information, right? As the package name suggests, pyMSAviz is a tool to visualize MSA, so I do not plan to implement any function to handle non-MSAs.

Sorry if I have misunderstood the meaning of your proposal.

cx994 commented 1 year ago

Sorry, I may not have made it clear~ As shown in the figure below, all amino acid sequences are gaps at some sites Snipaste_2022-11-16_21-31-56 So is it possible to omit these sites but keep the position information to get a more concise MSA visualization? I think it can preserves valid information and reduces drawing time.

moshi4 commented 1 year ago

Are you saying that if there is a gap-only position in the MSA, you want to determine that position as unnecessary and exclude it from the visualization?

Personally, I don't quite understand the effectiveness of the proposed functionality, as it seems to me that there are very few cases (or there shouldn't be any) where a gap-only position is included in the alignment results.

Could you please tell me the following to help me understand?

If I have misunderstood something, I am sorry.

cx994 commented 1 year ago
moshi4 commented 1 year ago

I have spent some time thinking about how to handle this issue.

It is not realistic to exclude gap-only positions one by one, as it would also shift the xticklabel and would not represent the proper visualization results. Personally, I think it would be reasonable to add an option to automatically exclude areas containing only gaps from the visualization on a MSA Wrap Block basis.

Below is an experimental implementation (add ignore_all_gaps option) of the visualization demo.

from pymsaviz import MsaViz
from Bio.Align import MultipleSeqAlignment
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

gap_num = 50
test_msa = MultipleSeqAlignment(
    [
        SeqRecord(Seq("M-AT----ALLCRGRI" + "-" * gap_num + "AITFR---RGRI--"), id="01"),
        SeqRecord(Seq("M-TI-------TRGVI" + "-" * gap_num + "AITFR---RGRI--"), id="02"),
    ]
)
mv = MsaViz(test_msa, wrap_length=30, show_grid=True)
mv.set_plot_params(ignore_all_gaps=True, ticks_interval=5) # <= Newly added!!
fig = mv.plotfig()

Option: ignore_all_gaps=False => Gap-only MSA wrap block exist ignore_gaps_false

Option: ignore_all_gaps=True => No gap-only MSA wrap block ignore_gaps_true

I think this is a realistic and easy implementation. What do you think?

Also, this is just a personal interest question, but in what situations or tools is sparse MSA generated? I don't see it in common multiple alignment tools like muscle or mafft, so can you tell me for reference?

cx994 commented 1 year ago

Oh, great! I think it will solve my problem to some extent. I've tried to exclude gap-only positions one by one but found it's really cumbersome if I want to keep true xticklabel~ Besides, I don't quite understand why there are sparse MSA results. But in the results downloaded from the below database, most of the MSA file are sparse! TreeFam database All in all, thank you for your kind help! I will continue to think about how to solve this problem in my spare time :)

cx994 commented 1 year ago

To add, I think there is a convenient way:

moshi4 commented 1 year ago

I did some checking on TreeFam.

Your MSA is based on extracting some data from the MSA of 400 TRK genes, correct? If so, it is not surprising that the gap-only positions are included. If you are interested only in the extracted gene sequences, I suggest you remove the gaps from the extracted sequences by yourself and align them again with maftt or muscle. You will get more accurate alignment results that way. If you don't necessarily need to rely on TreeFam alignment results, it seems to me that people generally process their data that way. Also, if you do that, you will not have the problem you presented here.

These are my personal opinions. It may be superfluous, but I hope it will be helpful.

moshi4 commented 1 year ago

Gap-only sites in MSAs are essentially never entered in normal operation. Even if a gap-only site were to exist for some reason, it would not be considered meaningful for data analysis and should be removed in the preprocessing stage of visualization.

Therefore, I shall consider not to implement processing for gap-only sites in pyMSAviz.