torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

table formats for making OTU table from swarm results #162

Closed SanniH closed 3 years ago

SanniH commented 3 years ago

Hi!

I have been trying to use the example here to make an OTU table after running SWARM. I am using an amplicon table (.csv) generated via a different pipeline, and I was hoping you might be able to help me figure out why it is not working for me? (code below copied)

I have attempted it with having col1 = amplicon id, and the rest of the cols are sample names, with amplicons on rows and I have also tried with my original table of col1 = amplicon id, col2 = total abundance, cols n-m are samples, and the final col is the amplicon sequence. Neither work. I also tried using both the swarm output file and the swarm fasta of rep seqs as the "SWARMS" file as in the code, but perhaps I am misunderstanding the naming?

(each of the below examples of my input csv are made up, not real numbers) e.g.1 id,100a,101a,102a, .... uniq1,0,3,4, ... uniq2,2,4,5, ... ...

e.g. 2 id,size,100a,101a,102a, ... ,sequence uniq1,100,0,3,4, ... ,ATGCGATAG uniq2,213,2,4,5, ... ,GTAGATTGA

code as copied from the example: STATS="amplicons.stats" SWARMS="amplicons.swarms" AMPLICON_TABLE="amplicon_contingency_table.csv" OTU_TABLE="OTU_contingency_table.csv"

echo -e "OTU\t$(head -n 1 "${AMPLICON_TABLE}")" > "${OTU_TABLE}"

awk -v SWARM="${SWARMS}" \ -v TABLE="${AMPLICON_TABLE}" \ 'BEGIN {FS = " " while ((getline < SWARM) > 0) { swarms[$1] = $0 } FS = "\t" while ((getline < TABLE) > 0) { table[$1] = $0 } }

 {# Parse the stat file (OTUs sorted by decreasing abundance)
  seed = $3 "_" $4
  n = split(swarms[seed], OTU, "[ _]")
  for (i = 1; i < n; i = i + 2) {
      s = split(table[OTU[i]], abundances, "\t")
      for (j = 1; j < s; j++) {
          samples[j] += abundances[j+1]
      }
  }
  printf "%s\t%s", NR, $3
  for (j = 1; j < s; j++) {
      printf "\t%s", samples[j]
  }
 printf "\n"
 delete samples
 }' "${STATS}" >> "${OTU_TABLE}"

All I get in both cases is basically an empty csv table with amplicon ids and sample names, but no abundances. I haven't been able to figure out how to combine the information I have (i.e. the amplicon table) with the information I get from SWARM, so any help would be highly appreciated!

I've attached here a gzipped folder of the files I've used in case you wish to try and replicate it. files included: Utila_ESV_table.csv (amplicon table) UTILA.swarms (swarm fasta file) UTILA.stats UTILA_swarm_output swarm2otu_issue.gz

Many thanks in advance, Sanni Hintikka

frederic-mahe commented 3 years ago

hi @SanniH

as far as I can tell there are four issues with your input data. Once that's fixed, t he code below works as expected and produces an occurrence table OTU_contingency_table.csv:

STATS="UTILA.stats"
SWARMS="UTILA_swarm_output"
AMPLICON_TABLE="Utila_ESV_table.csv"
OTU_TABLE="OTU_contingency_table.csv"

##  fix issues:
# - convert from DOS to unix
# - 'SWARMS' should be swarm's output, not a fasta file with representative sequences
# - input table should be tsv, not csv
# - abundance format should be "_", not ";size="
dos2unix "${AMPLICON_TABLE}" "${SWARMS}" "${STATS}"
sed -i 's/,/\t/g' "${AMPLICON_TABLE}"
sed -i 's/;size=/_/g ; s/;//g' "${SWARMS}"

echo -e "OTU\t$(head -n 1 "${AMPLICON_TABLE}")" > "${OTU_TABLE}"

awk \
    -v SWARM="${SWARMS}" \
    -v TABLE="${AMPLICON_TABLE}" \
    'BEGIN {FS = " "
            while ((getline < SWARM) > 0) {
                swarms[$1] = $0
                }
            FS = "\t"
            while ((getline < TABLE) > 0) {
                table[$1] = $0
                }
           }

 {# Parse the stat file (OTUs sorted by decreasing abundance)
  seed = $3 "_" $4
  n = split(swarms[seed], OTU, "[ _]")
  for (i = 1; i < n; i = i + 2) {
      s = split(table[OTU[i]], abundances, "\t")
      for (j = 1; j < s; j++) {
          samples[j] += abundances[j+1]
      }
  }
  printf "%s\t%s", NR, $3
  for (j = 1; j < s; j++) {
      printf "\t%s", samples[j]
  }
 printf "\n"
 delete samples
 }' "${STATS}" >> "${OTU_TABLE}"
SanniH commented 3 years ago

Yes perfect thank you!! That works :)

Cheers, Sanni