torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

average pairwise distances of each OTU #88

Closed ranocchia4 closed 7 years ago

ranocchia4 commented 7 years ago

Hello Frederic,

I am using swarm to cluster my sequences, but I would like to know the min, max and average pairwise distances between the different amplicons within each OTU. Does swarm produce this kind of information in one of the output files? Thank you very much for your help, Francesca

frederic-mahe commented 7 years ago

Hi @ranocchia4 ,

this is not a statistics that swarm can output directly, precisely because swarm is very different from other clustering methods. For a given OTU, other methods systematically compute the distance of all amplicons to the OTU seed. Swarm outputs a different information: a minimum spanning tree, which is a collection of uninterrupted paths of amplicons separated by very short links (/d/), all starting from the OTU seed and reaching the OTU margins.

It is possible to go back to the data and to compute some additional statistics post-clustering. But that will take some computation time, depending on what you want exactly (amplicons vs seed average distance, or all-vs.-all amplicons average distance?)

If you are interested in the max amplicon-vs.-seed distance, I wrote a script to do that a while ago (note, I am not sure the script can work with results produced with the /fastidious/ option):

INPUT="my_project"
VSEARCH="/usr/bin/vsearch"

SWARMS="${INPUT}.swarms"
STATS="${INPUT}.stats"
STATS2="${INPUT}.stats2"
SEQUENCES=$(mktemp)
VSEARCH_OUTPUT=$(mktemp)
ALL_SEQUENCES=$(mktemp)

# Reduce dataset (keep only first and last amplicon in each OTU)
grep \
    --no-group-separator \
    -A 1 \
    -F \
    -f <(grep -F -f <(awk '{if ($1 > 1) print $3}' "${STATS}") "${SWARMS}" | \
                   awk '{print $1"\n"$NF}' | \
                   cut -d ";" -f 1) "${INPUT}.fas" > "${ALL_SEQUENCES}"

# Extract sequences (for the top 1000 OTUs)
head -n 1000 "${STATS}" | \
while read uniques reads seed seed_mass low_abundant generations max_radius ; do
    if [[ "${uniques}" -gt 1 ]] ; then
        grep -A 1 -m 1 "^>${seed}" "${ALL_SEQUENCES}" > "${SEQUENCES}"
        LAST=$(grep -m 1 "^${seed}" "${SWARMS}" | awk '{print $NF}')
        grep -A 1 -m 1 "^>${LAST}" "${ALL_SEQUENCES}" >> "${SEQUENCES}"
        "${VSEARCH}" \
            --allpairs_global "${SEQUENCES}" \
            --acceptall \
            --userout "${VSEARCH_OUTPUT}" \
            --userfields id1 \
            --threads 1 \
            --quiet
        IDENTITY=$(< "${VSEARCH_OUTPUT}")
    else
        IDENTITY="NA"
    fi
    printf "%d\t%d\t%s\t%d\t%d\t%d\t%d\t%s\n" \
           ${uniques} \
           ${reads} \
           ${seed} \
           ${seed_mass} \
           ${low_abundant} \
           ${generations} \
           ${max_radius} \
           ${IDENTITY}
done >> "${STATS2}"

rm "${SEQUENCES}" "${VSEARCH_OUTPUT}" "${ALL_SEQUENCES}"

You should be able to modify that script to compute min and average. Although, the distance distribution being strongly skewed, I suggest to compute a median distance rather than an average.

Best,

ranocchia4 commented 7 years ago

Thank you so much for your answer Frederic.

I will try your script, hopefully I will get it to work.

Francesca

From: Frédéric Mahé notifications@github.com Reply-To: torognes/swarm reply@reply.github.com Date: Friday, September 16, 2016 at 2:17 AM To: torognes/swarm swarm@noreply.github.com Cc: ranocchia4 fdemarti@asu.edu, Mention mention@noreply.github.com Subject: Re: [torognes/swarm] average pairwise distances of each OTU (#88)

Hi @ranocchia4https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ranocchia4&d=CwMFaQ&c=AGbYxfJbXK67KfXyGqyv2Ejiz41FqQuZFk4A-1IxfAU&r=mQAYhQrunkmkm0FDyw1c2N4v0ABED9Ktp0ZULGhbhRY&m=KfvNHyap5B9yavccWizt7q6MkmCrIodhOI6_Hr-FUjs&s=ptJna-P7Qacq_zO1Iw6vFRc7UZOYbujpajMywAWTGe8&e= ,

this is not a statistics that swarm can output directly, precisely because swarm is very different from other clustering methods. For a given OTU, other methods systematically compute the distance of all amplicons to the OTU seed. Swarm outputs a different information: a minimum spanning tree, which is a collection of uninterrupted paths of amplicons separated by very short links (/d/), all starting from the OTU seed and reaching the OTU margins.

It is possible to go back to the data and to compute some additional statistics post-clustering. But that will take some computation time, depending on what you want exactly (amplicons vs seed average distance, or all-vs.-all amplicons average distance?)

If you are interested in the max amplicon-vs.-seed distance, I wrote a script to do that a while ago (note, I am not sure the script can work with results produced with the /fastidious/ option):

INPUT="my_project"

VSEARCH="/usr/bin/vsearch"

SWARMS="${INPUT}.swarms"

STATS="${INPUT}.stats"

STATS2="${INPUT}.stats2"

SEQUENCES=$(mktemp)

VSEARCH_OUTPUT=$(mktemp)

ALL_SEQUENCES=$(mktemp)

Reduce dataset (keep only first and last amplicon in each OTU)

grep \

--no-group-separator \

-A 1 \

-F \

-f <(grep -F -f <(awk '{if ($1 > 1) print $3}' "${STATS}") "${SWARMS}" | \

               awk '{print $1"\n"$NF}' | \

               cut -d ";" -f 1) "${INPUT}.fas" > "${ALL_SEQUENCES}"

Extract sequences (for the top 1000 OTUs)

head -n 1000 "${STATS}" | \

while read uniques reads seed seed_mass low_abundant generations max_radius ; do

if [[ "${uniques}" -gt 1 ]] ; then

    grep -A 1 -m 1 "^>${seed}" "${ALL_SEQUENCES}" > "${SEQUENCES}"

    LAST=$(grep -m 1 "^${seed}" "${SWARMS}" | awk '{print $NF}')

    grep -A 1 -m 1 "^>${LAST}" "${ALL_SEQUENCES}" >> "${SEQUENCES}"

    "${VSEARCH}" \

        --allpairs_global "${SEQUENCES}" \

        --acceptall \

        --userout "${VSEARCH_OUTPUT}" \

        --userfields id1 \

        --threads 1 \

        --quiet

    IDENTITY=$(< "${VSEARCH_OUTPUT}")

else

    IDENTITY="NA"

fi

printf "%d\t%d\t%s\t%d\t%d\t%d\t%d\t%s\n" \

       ${uniques} \

       ${reads} \

       ${seed} \

       ${seed_mass} \

       ${low_abundant} \

       ${generations} \

       ${max_radius} \

       ${IDENTITY}

done >> "${STATS2}"

rm "${SEQUENCES}" "${VSEARCH_OUTPUT}" "${ALL_SEQUENCES}"

You should be able to modify that script to compute min and average. Although, the distance distribution being strongly skewed, I suggest to compute a median distance rather than an average.

Best,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_torognes_swarm_issues_88-23issuecomment-2D247555363&d=CwMFaQ&c=AGbYxfJbXK67KfXyGqyv2Ejiz41FqQuZFk4A-1IxfAU&r=mQAYhQrunkmkm0FDyw1c2N4v0ABED9Ktp0ZULGhbhRY&m=KfvNHyap5B9yavccWizt7q6MkmCrIodhOI6_Hr-FUjs&s=pAEAUDnr1UYNiUKGulld7K25S4tJwosk86pIV5JZUUk&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AT-2D51deErHlAIIgf6CXrhAkpSvE-2DzWLSks5qql6WgaJpZM4J-2DeoG&d=CwMFaQ&c=AGbYxfJbXK67KfXyGqyv2Ejiz41FqQuZFk4A-1IxfAU&r=mQAYhQrunkmkm0FDyw1c2N4v0ABED9Ktp0ZULGhbhRY&m=KfvNHyap5B9yavccWizt7q6MkmCrIodhOI6_Hr-FUjs&s=I_HieHh6iQZbvhbd25d88Bsa6YYxEquwl0jRvarZsms&e=.

frederic-mahe commented 7 years ago

Reporting average pairwise distances should not be swarm's job, as swarm is not a classic clustering method. Nonetheless, the script above shows how intra-OTU distances can be computed after clustering with swarm.

I am going to close that issue.