Open edgardomortiz opened 4 years ago
Implementing this during initial clustering is quite an undertaking. However, would it be sufficient for your purpose to do semi-global alignment in the reverse complement step and check the identity based on this semi-global alignment? This should have the same outcome any is much easier to implement.
Yes, that would be awesome!
To comment further on this, it is actually nice to have an option to fuse clusters which are highly identical when comparing seq1 sense vs seq2 sense rather then only comparing seq1 and reverse complement seq2 in the detect_reverse_complement script in the consensus.py module.
I was getting separate clusters because the primer regions might be ambiguous due to low coverage and low sequencing quality/higher degree of mutations. Thus, I would get multiple clusters which were highly similar when comparing sense vs sense (95%+ identity, only differing in primer regions). This is something which is easy to add (I did it a bit ugly here but it works):
def detect_reverse_complements(centers, rc_identity_threshold):
filtered_centers = []
already_removed = set()
for i, (nr_reads_in_cl, c_id, seq, reads_path) in enumerate(centers):
if type(reads_path) != list:
all_reads = [reads_path]
else:
all_reads = reads_path
merged_cluster_id = c_id
merged_nr_reads = nr_reads_in_cl
if c_id in already_removed:
print("has already been merged, skipping")
continue
elif i == len(centers) - 1: # last sequence and it is not in already removed
filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads ] )
else:
for j, (nr_reads_in_cl2, c_id2, seq2, reads_path) in enumerate(centers[i+1:]):
#check for reverse complement matches
seq2_rc = reverse_complement(seq2)
seq_aln_rc, seq2_rc_aln, cigar_string_rc, cigar_tuples_rc, alignment_score_rc = parasail_alignment(seq, seq2_rc)
seq_aln, seq2_aln, cigar_string, cigar_tuples, alignment_score = parasail_alignment(seq, seq2)
#Determine alignment identity between complement seq1 and reverse complement seq2
nr_mismatching_pos_rc = len([1 for n1, n2 in zip(seq_aln_rc, seq2_rc_aln) if n1 != n2])
total_pos_rc = len(seq_aln_rc)
aln_identity_rc = (total_pos_rc - nr_mismatching_pos_rc) / float(total_pos_rc)
#Determine alignment identity between complement seq1 and complement seq2
nr_mismatching_pos = len([1 for n1, n2 in zip(seq_aln, seq2_aln) if n1 != n2])
total_pos = len(seq_aln)
aln_identity = (total_pos - nr_mismatching_pos) / float(total_pos)
print("Complement and Reverse Complement Identity seq 1 and 2: " + str(aln_identity_rc))
print("Complement and Complement Identity seq 1 and 2: " + str(aln_identity))
if aln_identity_rc >= rc_identity_threshold:
print("Detected alignment identidy above threshold for complement vs reverse complement. Keeping center with the most read support and adding rc reads to supporting reads.")
merged_nr_reads += nr_reads_in_cl2
already_removed.add(c_id2)
if type(reads_path) != list:
all_reads.append(reads_path)
else:
for rp in reads_path:
all_reads.append(rp)
if aln_identity >= rc_identity_threshold:
print("Detected alignment identidy above threshold for complement vs complement. Keeping center with the most read support and adding rc reads to supporting reads.")
merged_nr_reads += nr_reads_in_cl2
already_removed.add(c_id2)
if type(reads_path) != list:
all_reads.append(reads_path)
else:
for rp in reads_path:
all_reads.append(rp)
filtered_centers.append( [merged_nr_reads, c_id, seq, all_reads] )
print(len(filtered_centers), "consensus formed.")
return filtered_centers
Hello Kristoffer,
I ran
NGSpeciesID
with the following command:And everything worked nicely but in some cases it creates two separate consensus sequences in opposite directions that should have been joined by
--rc_identity_threshold 0.85
. For example in the picture, the overlapping region of both consensuses has a similarity of 98.7%Would it be too hard to implement reverse complement during the initial clustering steps to avoid this kind of result? This would be very useful for the case of fragmented amplicons (and it would save time in general, even with full length amplicons, by avoiding the separate
spoa
centering of the forward and reverse clusters just to be merged afterwards by--rc_identity_threshold
).