torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
125 stars 23 forks source link

control number of generations, for the iterative growth process #68

Closed nikosdarzentas closed 5 years ago

nikosdarzentas commented 8 years ago

I would suggest an option to control the maximum number of generations for growth. The idea of this process being guided by the data themselves and assisted by the breaking phase is definitely great, but there are cases where the 'branches' become too long and the intra-cluster variability is too large.

I actually ended up using the internal structure output to do what I wanted - still very useful since swarm with d=1 is very fast and very 'clean' (as far as I can tell).

Congrats on the work so far, thank you.

frederic-mahe commented 8 years ago

Dear @nikosdarzentas,

I understand you're request. Swarm can report clusters with many generations (I have examples with 50+ generations). High number of generations suggest that the amplicons at the end of these long branches are very distant from the cluster seed. When you actually do the pairwise comparison, you observe that the distances are usually far less important. The branches are not like straight arrows, but are rather zigzaging through the amplicon-space, ending up not so far from where they started.

The breaking phase built in swarm also implies that the risk of over-grouping is low and that amplicons at the end of branches typically have low abundances (i.e. weigh almost nothing in the final clusters).

Introducing a growth limit will have three possible outcomes for these branch tips:

Based on these reasons, I am reluctant to the idea of introducing an option to set a growth limit. Such an option would be very leaning towards an arbitrary global clustering threshold, a concept we are fighting against.

Unless you can present me with examples justifying a growth limit?

nikosdarzentas commented 8 years ago

Thank you for your comprehensive reply @frederic-mahe - I appreciate your time.

Our current main interest is rearranged antigen receptors of adaptive immunity, e.g. immunoglobulins, and most of the time most we're focusing on the most informative junction of such rearrangements, the CDR3. These junctions are just tens of nucleotides, and a handful of nucleotide differences in specific places can make a lot of a difference in analysis and interpretation (you could end up looking at different clonal populations very quickly). Also, our abundance distributions are not always easy to understand or predict, which means that you cannot always rely on them to help 'swarm' figure it out. For those and other reasons, it's wise (if not necessary) to keep an eye out on how far cluster members have gone.

I understand what you're trying to do with 'swarm', I praised it in my initial post. It's also inviting to use a tool when you see just a handful of options - simplicity does matter. But frankly it's tricky for us to let 'swarm' loose on our data, and I would imagine other users would like to have some safeguards and direct control in case of special sequences and systems.

You could argue that users could pre-/post-process data and results (e.g. 'hack' abundances beforehand, or work on the internal network/graph structure afterwards), however, I believe that there is space for a few additional options for more advanced or particular uses.

Many thanks.

frederic-mahe commented 5 years ago

@nikosdarzentas it is probably far too late, but here is a python version of swarm (d = 1) I use for prototyping. This one is modified to include a -m INTEGER parameter you can use to limit the growth of clusters (1 generation from the seed, 2 generations from the seed, etc). I archive it here, in case someone else needs such a feature.

This is written for python 2.7 and you will need biopython (used for fasta parsing). Using -m 0 means unlimited clustering (default behavior).

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
    Find clusters of nearly identical sequences in giant barcoding
    projects.
"""

from __future__ import print_function

__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
__date__ = "2019/03/04"
__version__ = "$Revision: 9.0"

import os
import sys
from Bio import SeqIO
from operator import itemgetter
from optparse import OptionParser

#*****************************************************************************#
#                                                                             #
#                                  Functions                                  #
#                                                                             #
#*****************************************************************************#

def option_parse():
    """Parse arguments from command line."""
    desc = """Find clusters of nearly identical amplicons in giant
    amplicon-based environmental or clinical projects. Amplicons are
    expected to be sorted by decreasing abundance."""

    parser = OptionParser(usage="usage: %prog --input_file filename",
                          description=desc,
                          version="%prog version 9.0")

    parser.add_option("-i", "--input_file",
                      metavar="<FILENAME>",
                      action="store",
                      dest="input_file",
                      help="set <FILENAME> as input fasta file.")

    parser.add_option("-m", "--max_gen",
                      action="store",
                      type="int",
                      dest="max_gen",
                      default=0,
                      help="Max number of generations when building cluster.")

    (options, args) = parser.parse_args()
    return options.input_file, options.max_gen

def parse_input_file(input_file):
    """Build a list of amplicons."""
    input_format = "fasta"
    amplicons = dict()
    order = list()
    with open(input_file, "rb") as input_file:
        for record in SeqIO.parse(input_file, input_format):
            seq = str(record.seq).lower()  # Convert all sequences to lowercase.
            # Store 0) amplicon_id, 1) amplicon abundance, 2) amplicon
            # status, 3) swarm mass, 4) swarm seed id
            amplicons[seq] = [record.id,
                              int(record.id.split(";size=")[1]),
                              True,
                              0,
                              ""]
            order.append(seq)
    return amplicons, order

def output_swarms(input_file, order, amplicons, swarms, d):
    """Write swarms to a file."""
    # Create new file name
    extension = "_" + str(d) + ".swarms_2_prototype"
    output_file = os.path.splitext(os.path.abspath(input_file))[0] + extension
    with open(output_file, "w") as output_file:
        for seed in order:
            seed_id = amplicons[seed][0]
            if seed_id in swarms:
                print(" ".join(swarms[seed_id]), file=output_file)
    return

def produce_microvariants(seq):
    """Produce all possible micro-variants, without repetition.

    The original sequence must be made exclusively of lowercase
    nucleotides (acgt). Micro-variants have one difference
    (substitution, insertion, or deletion) with the original sequence.
    """
    nucleotides = ("a", "c", "g", "t")
    seq = list(seq)
    length = len(seq)
    microvariants = list()
    # Insertions (insert once, then shift)
    tmp = [""] + seq[:]
    for i in xrange(0, length, 1):
        for nuc in nucleotides:
            if tmp[i+1] is not nuc:
                tmp[i] = nuc
                microvariants.append("".join(tmp))
        tmp[i] = tmp[i+1]
    # Insertions at the last position
    for nuc in nucleotides:
        tmp = seq[:] + [nuc]
        microvariants.append("".join(tmp))
    # Mutations and deletions
    for i in xrange(0, length, 1):
        tmp = seq[:]
        initial = tmp[i]
        for nuc in nucleotides:
            if initial is not nuc:  # Avoid useless mutations
                tmp[i] = nuc
                microvariants.append("".join(tmp))
        tmp[i] = initial  # Restore the initial sequence
        # Deletions
        try:
            if tmp[i] is not tmp[i+1]:  # delete at the end of homopolymers
                del tmp[i]
                microvariants.append("".join(tmp))
        except IndexError:  # Deletion at the last position
            del tmp[i]
            microvariants.append("".join(tmp))
    return microvariants

def main():
    """Load and parse input fasta file and clusterize it."""
    # Parse command line options.
    input_file, max_gen = option_parse()

    # Build a dict of amplicons and a list to preserve input order
    # (assuming decreasing abundance)
    amplicons, order = parse_input_file(input_file)

    # Store each swarm created in a dictionary: swarm[seed] = list(amplicons)
    swarms = dict()

    # Phase 1: d = 1 clustering -----------------------------------------------

    # Start swarming
    for seed in order:

        seed_id, seed_abundance, seed_status = amplicons[seed][0:3]
        if not seed_status:  # Skip amplicons already swarmed
            continue

        # Seed id and status
        swarm = [seed_id]
        amplicons[seed][2] = False  # could be seed_status = False
        amplicons[seed][4] = seed_id  # point to itself
        generation = 0

        # Create micro-variants
        microvariants = produce_microvariants(seed)

        # Which of these microvariants are in our dataset?
        hits = [(microvariant, amplicons[microvariant][1])
                for microvariant in microvariants
                if microvariant in amplicons
                and amplicons[microvariant][2]]  # No need to check
                                                 # abundance here

        # Isolated seed? close the swarm
        if not hits:
            swarms[swarm[0]] = swarm
            continue

        # Sort by decreasing abundance
        hits.sort(key=itemgetter(1, 0), reverse=True)

        # Add them to the swarm and update their status
        swarm.extend([amplicons[hit[0]][0] for hit in hits])
        for hit in hits:
            hit_seq = hit[0]
            amplicons[hit_seq][2] = False
            amplicons[hit_seq][4] = seed_id  # point to swarm seed

        # limit to m generations
        generation += 1
        if generation >= max_gen and max_gen > 0:
            swarms[swarm[0]] = swarm  # Memorize the swarm
            continue

        # Work on subseeds (also save a list of hits along the way)
        all_hits = [(seed, seed_abundance)]
        all_subseeds = [hits]
        all_hits.extend(hits)
        for subseeds in all_subseeds:
            nextseeds = list()
            for subseed in subseeds:
                subseed_seq, subseed_abundance = subseed[0], subseed[1]
                # Update subseed status
                amplicons[subseed_seq][2] = False
                # Produce all microvariants of subseed
                microvariants = produce_microvariants(subseed_seq)

                # Which of these microvariants are in our dataset?
                # (discard hits with higher abundance value)
                hits = [(microvariant, amplicons[microvariant][1])
                        for microvariant in microvariants
                        if microvariant in amplicons
                        and amplicons[microvariant][2]
                        and amplicons[microvariant][1] <= subseed_abundance]

                if not hits:  # subseed has no son
                    continue

                # Sort by decreasing abundance
                hits.sort(key=itemgetter(1, 0), reverse=True)
                nextseeds.extend(hits)  # Hits in nextseeds are not globally
                                        # sorted at that point

                # Add hits to the swarm and update their status
                swarm.extend([amplicons[hit[0]][0] for hit in hits])
                all_hits.extend(hits)
                for hit in hits:
                    hit_seq = hit[0]
                    amplicons[hit_seq][2] = False
                    amplicons[hit_seq][4] = seed_id  # point to swarm seed

            if not nextseeds:  # No new subseeds, end of the all_subseeds list
                swarms[swarm[0]] = swarm  # Memorize the swarm
                break

            # limit to m generations
            generation += 1
            if generation >= max_gen and max_gen > 0:
                swarms[swarm[0]] = swarm  # Memorize the swarm
                break

            # Hits in nextseeds are globally sorted (each layer of
            # microvariant is sorted by decreasing abundance)
            nextseeds.sort(key=itemgetter(1, 0), reverse=True)
            all_subseeds.append(nextseeds)

    # Output swarms (d = 1)
    output_swarms(input_file, order, amplicons, swarms, 1)

    return

#*****************************************************************************#
#                                                                             #
#                                     Body                                    #
#                                                                             #
#*****************************************************************************#

if __name__ == '__main__':

    main()

sys.exit(0)

Tests:

# Test the new max generation option
# 0: 1 OTU
# 1: 4 OTUs
# 2: 2 OTUs
# 3: 1 OTU
for m in 0 1 2 3 ; do
    printf ">s1;size=9\nA\n>s2;size=8\nAA\n>s3;size=7\nAAA\n>s4;size=6\nAAAA\n" > tmp.fas
    python2.7 swarm.py -i tmp.fas -m ${m}
    cat tmp_1.*
    rm tmp.*
done

# Test the new max generation option (add a second OTU)
# 0: 2 OTUs
# 1: 5 OTUs
# 2: 3 OTUs
# 3: 2 OTUs
for m in 0 1 2 3 ; do
    printf ">s1;size=9\nA\n>s2;size=8\nAA\n>s3;size=7\nAAA\n>s4;size=6\nAAAA\n>s5;size=6\nGGGG\n" > tmp.fas
    python2.7 swarm.py -i tmp.fas -m ${m}
    cat tmp_1.*
    rm tmp.*
done

I will close that issue, but feel free to ask questions or open a new issue if need be.

nikosdarzentas commented 5 years ago

@frederic-mahe - sorry, I just saw this! Many thanks. Indeed I found other indirect ways to do what I want, but this is good to know.