Closed apcamargo closed 2 years ago
would something like this work?
size_t IOService::loadProfile(const std::string& file_name, std::vector<CGappedSequence>& test)
{
istream* in;
ifstream infile;
if (execution_params.input_file_name=="STDIN"){
in=&cin;
}
else {
infile.open(execution_params.input_file_name, ios_base::in);
if (!infile.good())
return 0;
in = &infile;
}
string s;
string id, seq;
string reduced_sequence;
while (in->good())
{
getline(*in, s);
while (!s.empty() && (s[s.length()-1]=='\n' || s[s.length()-1]=='\r'))
s.pop_back();
if (s.empty())
continue;
if (s[0]=='>')
{
if (!id.empty() && !seq.empty())
{
for (auto &ch : seq){
if (ch!='-')
reduced_sequence.push_back(ch);
}
test.emplace_back(id, seq);
reduced_sequence.clear();
seq.clear();
}
id=s;
}
else
seq+=s;
}
if (!id.empty() && !seq.empty()){
test.push_back(CGappedSequence(id, seq));
}
ext_profile = new CProfile(test[0], ¶ms);
for (auto gs; range(test.begin()+1, test.end()))
{
ext_profile.AppendRawSequence(gs);
}
ext_profile.CalculateCounterScores();
return test.size();
}
@sachdved It should work, but pushing back single characters to the string without preallocating is not the fastest way of doing it. I would post a patch for the issue this week.
@sachdved Looked again - after reading the first line, the reduced_sequence
would be preallocated, so pushing back characters in the following lines would be constant-time. Though, doing that in place with std::remove
should be still a bit faster :)
I'll try adding in the std::remove portion!
There's actually one additional piece of code I forgot to include! What I showed is a way of taking the aligned sequences and making them CSequence objects, but I also have to make them CGappedSequence objects. This led me to defining an additional method in the CGappedSequence class. Hopefully, this also looks good?
CGappedSequence::CGappedSequence(const string& _id, const string& seq) : id(_id)
{
string reduced_seq;
for (auto &ch : seq)
{
if (ch!='-')
{
reduced_seq.push_back(ch);
}
}
size_t length=reduced_seq.length();
symbols.resize(length);
uppercase.resize(length);
for (size_t i=0; i<length; ++i)
{
char c=reduced_seq[i];
if(c>'Z')
{
c-=32;
uppercase[i]=false;
}
else
uppercase[i]=true;
char *q = find(mapping_table, mapping_table+25, c);
if (q==mapping_table+25)
symbols[i]=(symbol_t) UNKNOWN_SYMBOL;
else
symbols[i]=(symbol_t) (q-mapping_table);
}
size=symbols.size();
gapped_size=size;
n_gaps.resize(size+1,0);
int place=0;
int counter=0;
for (auto &ch : seq)
{
if (ch=='-')
{
counter++;
}
else {
n_gaps.at(place)=counter;
counter=0;
place++;
}
}
InitialiseDPS();
}
@apcamargo @sachdved I have just submitted the patch for the issue. Let me know if it works as expected. Adam
Other aligners, such as MAFFT, can take an aligned FASTA as input and then realign the sequences within it. Because FAMSA doesn't remove gaps before the alignment, using a MSA FASTA as input will result in a measingless output.