refresh-bio / FAMSA

Algorithm for ultra-scale multiple sequence alignments (3M protein sequences in 5 minutes and 24 GB of RAM)
GNU General Public License v3.0
150 stars 25 forks source link

Remove gaps automatically if the input is a MSA #23

Closed apcamargo closed 2 years ago

apcamargo commented 3 years ago

Other aligners, such as MAFFT, can take an aligned FASTA as input and then realign the sequences within it. Because FAMSA doesn't remove gaps before the alignment, using a MSA FASTA as input will result in a measingless output.

sachdved commented 2 years ago

would something like this work?

size_t IOService::loadProfile(const std::string& file_name, std::vector<CGappedSequence>& test)
{
    istream* in;
    ifstream infile;

    if (execution_params.input_file_name=="STDIN"){
      in=&cin;
    }
    else {
      infile.open(execution_params.input_file_name, ios_base::in);
      if (!infile.good())
        return 0;
      in = &infile;
    }

    string s;
    string id, seq;
    string reduced_sequence;    

    while (in->good())
      {
        getline(*in, s);
        while (!s.empty() && (s[s.length()-1]=='\n' || s[s.length()-1]=='\r'))
          s.pop_back();
        if (s.empty())
          continue;
        if (s[0]=='>')
          {
        if (!id.empty() && !seq.empty())
          {

            for (auto &ch : seq){
              if (ch!='-')
            reduced_sequence.push_back(ch);
            }
            test.emplace_back(id, seq);
            reduced_sequence.clear();
            seq.clear();
          }
        id=s;
          }
        else
          seq+=s;
      }
    if (!id.empty() && !seq.empty()){
      test.push_back(CGappedSequence(id, seq));
    }

    ext_profile = new CProfile(test[0], &params);
    for (auto gs; range(test.begin()+1, test.end()))
      {
        ext_profile.AppendRawSequence(gs);

      }
    ext_profile.CalculateCounterScores();

    return test.size();

}
agudys commented 2 years ago

@sachdved It should work, but pushing back single characters to the string without preallocating is not the fastest way of doing it. I would post a patch for the issue this week.

agudys commented 2 years ago

@sachdved Looked again - after reading the first line, the reduced_sequence would be preallocated, so pushing back characters in the following lines would be constant-time. Though, doing that in place with std::remove should be still a bit faster :)

sachdved commented 2 years ago

I'll try adding in the std::remove portion!

There's actually one additional piece of code I forgot to include! What I showed is a way of taking the aligned sequences and making them CSequence objects, but I also have to make them CGappedSequence objects. This led me to defining an additional method in the CGappedSequence class. Hopefully, this also looks good?

CGappedSequence::CGappedSequence(const string& _id, const string& seq) : id(_id)
{
  string reduced_seq;
  for (auto &ch : seq)
    {
    if (ch!='-')
      {
    reduced_seq.push_back(ch);
      }
    }
  size_t length=reduced_seq.length();
  symbols.resize(length);
  uppercase.resize(length);

  for (size_t i=0; i<length; ++i)
    {
      char c=reduced_seq[i];
      if(c>'Z')
    {
      c-=32;
      uppercase[i]=false;
    }
      else
    uppercase[i]=true;

      char *q = find(mapping_table, mapping_table+25, c);
      if (q==mapping_table+25)
    symbols[i]=(symbol_t) UNKNOWN_SYMBOL;
      else
    symbols[i]=(symbol_t) (q-mapping_table);
    }
  size=symbols.size();
  gapped_size=size;
  n_gaps.resize(size+1,0);
  int place=0;
  int counter=0;
  for (auto &ch : seq)
    {
      if (ch=='-')
    {
      counter++;
    }
      else {
      n_gaps.at(place)=counter;
      counter=0;
      place++;
    }
  }
InitialiseDPS();

}
agudys commented 2 years ago

@apcamargo @sachdved I have just submitted the patch for the issue. Let me know if it works as expected. Adam