j3xugit / RaptorX-3DModeling

Note that current version does not include search of very large metagenome data. For some proteins, metagenome data is important. We will update this as soon as possible.
GNU General Public License v3.0
99 stars 27 forks source link

Meff Calculation for a3m fromat #14

Open Maryam-Haghani opened 1 year ago

Maryam-Haghani commented 1 year ago

Hi!

I am confused about how your code calculates Meff in the a3m format when 'gaps aligned to inserts' are omitted. Specifically, it appears that the code treats matches (uppercase characters) and inserts (lowercase characters) in the same manner, and this results in a higher Meff value for the file.

To illustrate the issue, consider the following example using two sequences in the a3m format:

Sequence 1: HCTTKFCDYKAAGAEEYAQQEVVKRSYGKAFKLSISALFVTPKTAGAQVV
Sequence 2: HCTTKFCDYgKAAGAEEYAQQEVVKRSYGKAFKLSISALFVTPKTAGAQVV

In position 10 of the second sequence, there is a lowercase 'g'. Not adding a 'gap aligned to insert' in the corresponding position of the first sequence causes all subsequent residues to shift to the right and this results in considering these shifted residues as dissimilar, which are in fact the same. As a result, the number of dissimilarities increases, leading to an inflated Meff value for the MSA file.

Could you kindly explain the rationale behind this?