Closed nikosdarzentas closed 5 years ago
Dear @nikosdarzentas,
I understand you're request. Swarm can report clusters with many generations (I have examples with 50+ generations). High number of generations suggest that the amplicons at the end of these long branches are very distant from the cluster seed. When you actually do the pairwise comparison, you observe that the distances are usually far less important. The branches are not like straight arrows, but are rather zigzaging through the amplicon-space, ending up not so far from where they started.
The breaking phase built in swarm also implies that the risk of over-grouping is low and that amplicons at the end of branches typically have low abundances (i.e. weigh almost nothing in the final clusters).
Introducing a growth limit will have three possible outcomes for these branch tips:
Based on these reasons, I am reluctant to the idea of introducing an option to set a growth limit. Such an option would be very leaning towards an arbitrary global clustering threshold, a concept we are fighting against.
Unless you can present me with examples justifying a growth limit?
Thank you for your comprehensive reply @frederic-mahe - I appreciate your time.
Our current main interest is rearranged antigen receptors of adaptive immunity, e.g. immunoglobulins, and most of the time most we're focusing on the most informative junction of such rearrangements, the CDR3. These junctions are just tens of nucleotides, and a handful of nucleotide differences in specific places can make a lot of a difference in analysis and interpretation (you could end up looking at different clonal populations very quickly). Also, our abundance distributions are not always easy to understand or predict, which means that you cannot always rely on them to help 'swarm' figure it out. For those and other reasons, it's wise (if not necessary) to keep an eye out on how far cluster members have gone.
I understand what you're trying to do with 'swarm', I praised it in my initial post. It's also inviting to use a tool when you see just a handful of options - simplicity does matter. But frankly it's tricky for us to let 'swarm' loose on our data, and I would imagine other users would like to have some safeguards and direct control in case of special sequences and systems.
You could argue that users could pre-/post-process data and results (e.g. 'hack' abundances beforehand, or work on the internal network/graph structure afterwards), however, I believe that there is space for a few additional options for more advanced or particular uses.
Many thanks.
@nikosdarzentas it is probably far too late, but here is a python version of swarm (d = 1) I use for prototyping. This one is modified to include a -m INTEGER
parameter you can use to limit the growth of clusters (1 generation from the seed, 2 generations from the seed, etc). I archive it here, in case someone else needs such a feature.
This is written for python 2.7
and you will need biopython
(used for fasta parsing). Using -m 0
means unlimited clustering (default behavior).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Find clusters of nearly identical sequences in giant barcoding
projects.
"""
from __future__ import print_function
__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
__date__ = "2019/03/04"
__version__ = "$Revision: 9.0"
import os
import sys
from Bio import SeqIO
from operator import itemgetter
from optparse import OptionParser
#*****************************************************************************#
# #
# Functions #
# #
#*****************************************************************************#
def option_parse():
"""Parse arguments from command line."""
desc = """Find clusters of nearly identical amplicons in giant
amplicon-based environmental or clinical projects. Amplicons are
expected to be sorted by decreasing abundance."""
parser = OptionParser(usage="usage: %prog --input_file filename",
description=desc,
version="%prog version 9.0")
parser.add_option("-i", "--input_file",
metavar="<FILENAME>",
action="store",
dest="input_file",
help="set <FILENAME> as input fasta file.")
parser.add_option("-m", "--max_gen",
action="store",
type="int",
dest="max_gen",
default=0,
help="Max number of generations when building cluster.")
(options, args) = parser.parse_args()
return options.input_file, options.max_gen
def parse_input_file(input_file):
"""Build a list of amplicons."""
input_format = "fasta"
amplicons = dict()
order = list()
with open(input_file, "rb") as input_file:
for record in SeqIO.parse(input_file, input_format):
seq = str(record.seq).lower() # Convert all sequences to lowercase.
# Store 0) amplicon_id, 1) amplicon abundance, 2) amplicon
# status, 3) swarm mass, 4) swarm seed id
amplicons[seq] = [record.id,
int(record.id.split(";size=")[1]),
True,
0,
""]
order.append(seq)
return amplicons, order
def output_swarms(input_file, order, amplicons, swarms, d):
"""Write swarms to a file."""
# Create new file name
extension = "_" + str(d) + ".swarms_2_prototype"
output_file = os.path.splitext(os.path.abspath(input_file))[0] + extension
with open(output_file, "w") as output_file:
for seed in order:
seed_id = amplicons[seed][0]
if seed_id in swarms:
print(" ".join(swarms[seed_id]), file=output_file)
return
def produce_microvariants(seq):
"""Produce all possible micro-variants, without repetition.
The original sequence must be made exclusively of lowercase
nucleotides (acgt). Micro-variants have one difference
(substitution, insertion, or deletion) with the original sequence.
"""
nucleotides = ("a", "c", "g", "t")
seq = list(seq)
length = len(seq)
microvariants = list()
# Insertions (insert once, then shift)
tmp = [""] + seq[:]
for i in xrange(0, length, 1):
for nuc in nucleotides:
if tmp[i+1] is not nuc:
tmp[i] = nuc
microvariants.append("".join(tmp))
tmp[i] = tmp[i+1]
# Insertions at the last position
for nuc in nucleotides:
tmp = seq[:] + [nuc]
microvariants.append("".join(tmp))
# Mutations and deletions
for i in xrange(0, length, 1):
tmp = seq[:]
initial = tmp[i]
for nuc in nucleotides:
if initial is not nuc: # Avoid useless mutations
tmp[i] = nuc
microvariants.append("".join(tmp))
tmp[i] = initial # Restore the initial sequence
# Deletions
try:
if tmp[i] is not tmp[i+1]: # delete at the end of homopolymers
del tmp[i]
microvariants.append("".join(tmp))
except IndexError: # Deletion at the last position
del tmp[i]
microvariants.append("".join(tmp))
return microvariants
def main():
"""Load and parse input fasta file and clusterize it."""
# Parse command line options.
input_file, max_gen = option_parse()
# Build a dict of amplicons and a list to preserve input order
# (assuming decreasing abundance)
amplicons, order = parse_input_file(input_file)
# Store each swarm created in a dictionary: swarm[seed] = list(amplicons)
swarms = dict()
# Phase 1: d = 1 clustering -----------------------------------------------
# Start swarming
for seed in order:
seed_id, seed_abundance, seed_status = amplicons[seed][0:3]
if not seed_status: # Skip amplicons already swarmed
continue
# Seed id and status
swarm = [seed_id]
amplicons[seed][2] = False # could be seed_status = False
amplicons[seed][4] = seed_id # point to itself
generation = 0
# Create micro-variants
microvariants = produce_microvariants(seed)
# Which of these microvariants are in our dataset?
hits = [(microvariant, amplicons[microvariant][1])
for microvariant in microvariants
if microvariant in amplicons
and amplicons[microvariant][2]] # No need to check
# abundance here
# Isolated seed? close the swarm
if not hits:
swarms[swarm[0]] = swarm
continue
# Sort by decreasing abundance
hits.sort(key=itemgetter(1, 0), reverse=True)
# Add them to the swarm and update their status
swarm.extend([amplicons[hit[0]][0] for hit in hits])
for hit in hits:
hit_seq = hit[0]
amplicons[hit_seq][2] = False
amplicons[hit_seq][4] = seed_id # point to swarm seed
# limit to m generations
generation += 1
if generation >= max_gen and max_gen > 0:
swarms[swarm[0]] = swarm # Memorize the swarm
continue
# Work on subseeds (also save a list of hits along the way)
all_hits = [(seed, seed_abundance)]
all_subseeds = [hits]
all_hits.extend(hits)
for subseeds in all_subseeds:
nextseeds = list()
for subseed in subseeds:
subseed_seq, subseed_abundance = subseed[0], subseed[1]
# Update subseed status
amplicons[subseed_seq][2] = False
# Produce all microvariants of subseed
microvariants = produce_microvariants(subseed_seq)
# Which of these microvariants are in our dataset?
# (discard hits with higher abundance value)
hits = [(microvariant, amplicons[microvariant][1])
for microvariant in microvariants
if microvariant in amplicons
and amplicons[microvariant][2]
and amplicons[microvariant][1] <= subseed_abundance]
if not hits: # subseed has no son
continue
# Sort by decreasing abundance
hits.sort(key=itemgetter(1, 0), reverse=True)
nextseeds.extend(hits) # Hits in nextseeds are not globally
# sorted at that point
# Add hits to the swarm and update their status
swarm.extend([amplicons[hit[0]][0] for hit in hits])
all_hits.extend(hits)
for hit in hits:
hit_seq = hit[0]
amplicons[hit_seq][2] = False
amplicons[hit_seq][4] = seed_id # point to swarm seed
if not nextseeds: # No new subseeds, end of the all_subseeds list
swarms[swarm[0]] = swarm # Memorize the swarm
break
# limit to m generations
generation += 1
if generation >= max_gen and max_gen > 0:
swarms[swarm[0]] = swarm # Memorize the swarm
break
# Hits in nextseeds are globally sorted (each layer of
# microvariant is sorted by decreasing abundance)
nextseeds.sort(key=itemgetter(1, 0), reverse=True)
all_subseeds.append(nextseeds)
# Output swarms (d = 1)
output_swarms(input_file, order, amplicons, swarms, 1)
return
#*****************************************************************************#
# #
# Body #
# #
#*****************************************************************************#
if __name__ == '__main__':
main()
sys.exit(0)
Tests:
# Test the new max generation option
# 0: 1 OTU
# 1: 4 OTUs
# 2: 2 OTUs
# 3: 1 OTU
for m in 0 1 2 3 ; do
printf ">s1;size=9\nA\n>s2;size=8\nAA\n>s3;size=7\nAAA\n>s4;size=6\nAAAA\n" > tmp.fas
python2.7 swarm.py -i tmp.fas -m ${m}
cat tmp_1.*
rm tmp.*
done
# Test the new max generation option (add a second OTU)
# 0: 2 OTUs
# 1: 5 OTUs
# 2: 3 OTUs
# 3: 2 OTUs
for m in 0 1 2 3 ; do
printf ">s1;size=9\nA\n>s2;size=8\nAA\n>s3;size=7\nAAA\n>s4;size=6\nAAAA\n>s5;size=6\nGGGG\n" > tmp.fas
python2.7 swarm.py -i tmp.fas -m ${m}
cat tmp_1.*
rm tmp.*
done
I will close that issue, but feel free to ask questions or open a new issue if need be.
@frederic-mahe - sorry, I just saw this! Many thanks. Indeed I found other indirect ways to do what I want, but this is good to know.
I would suggest an option to control the maximum number of generations for growth. The idea of this process being guided by the data themselves and assisted by the breaking phase is definitely great, but there are cases where the 'branches' become too long and the intra-cluster variability is too large.
I actually ended up using the internal structure output to do what I wanted - still very useful since swarm with d=1 is very fast and very 'clean' (as far as I can tell).
Congrats on the work so far, thank you.