Closed SanniH closed 3 years ago
hi @SanniH
as far as I can tell there are four issues with your input data. Once that's fixed, t he code below works as expected and produces an occurrence table OTU_contingency_table.csv:
STATS="UTILA.stats"
SWARMS="UTILA_swarm_output"
AMPLICON_TABLE="Utila_ESV_table.csv"
OTU_TABLE="OTU_contingency_table.csv"
## fix issues:
# - convert from DOS to unix
# - 'SWARMS' should be swarm's output, not a fasta file with representative sequences
# - input table should be tsv, not csv
# - abundance format should be "_", not ";size="
dos2unix "${AMPLICON_TABLE}" "${SWARMS}" "${STATS}"
sed -i 's/,/\t/g' "${AMPLICON_TABLE}"
sed -i 's/;size=/_/g ; s/;//g' "${SWARMS}"
echo -e "OTU\t$(head -n 1 "${AMPLICON_TABLE}")" > "${OTU_TABLE}"
awk \
-v SWARM="${SWARMS}" \
-v TABLE="${AMPLICON_TABLE}" \
'BEGIN {FS = " "
while ((getline < SWARM) > 0) {
swarms[$1] = $0
}
FS = "\t"
while ((getline < TABLE) > 0) {
table[$1] = $0
}
}
{# Parse the stat file (OTUs sorted by decreasing abundance)
seed = $3 "_" $4
n = split(swarms[seed], OTU, "[ _]")
for (i = 1; i < n; i = i + 2) {
s = split(table[OTU[i]], abundances, "\t")
for (j = 1; j < s; j++) {
samples[j] += abundances[j+1]
}
}
printf "%s\t%s", NR, $3
for (j = 1; j < s; j++) {
printf "\t%s", samples[j]
}
printf "\n"
delete samples
}' "${STATS}" >> "${OTU_TABLE}"
Yes perfect thank you!! That works :)
Cheers, Sanni
Hi!
I have been trying to use the example here to make an OTU table after running SWARM. I am using an amplicon table (.csv) generated via a different pipeline, and I was hoping you might be able to help me figure out why it is not working for me? (code below copied)
I have attempted it with having col1 = amplicon id, and the rest of the cols are sample names, with amplicons on rows and I have also tried with my original table of col1 = amplicon id, col2 = total abundance, cols n-m are samples, and the final col is the amplicon sequence. Neither work. I also tried using both the swarm output file and the swarm fasta of rep seqs as the "SWARMS" file as in the code, but perhaps I am misunderstanding the naming?
(each of the below examples of my input csv are made up, not real numbers) e.g.1 id,100a,101a,102a, .... uniq1,0,3,4, ... uniq2,2,4,5, ... ...
e.g. 2 id,size,100a,101a,102a, ... ,sequence uniq1,100,0,3,4, ... ,ATGCGATAG uniq2,213,2,4,5, ... ,GTAGATTGA
code as copied from the example: STATS="amplicons.stats" SWARMS="amplicons.swarms" AMPLICON_TABLE="amplicon_contingency_table.csv" OTU_TABLE="OTU_contingency_table.csv"
echo -e "OTU\t$(head -n 1 "${AMPLICON_TABLE}")" > "${OTU_TABLE}"
awk -v SWARM="${SWARMS}" \ -v TABLE="${AMPLICON_TABLE}" \ 'BEGIN {FS = " " while ((getline < SWARM) > 0) { swarms[$1] = $0 } FS = "\t" while ((getline < TABLE) > 0) { table[$1] = $0 } }
All I get in both cases is basically an empty csv table with amplicon ids and sample names, but no abundances. I haven't been able to figure out how to combine the information I have (i.e. the amplicon table) with the information I get from SWARM, so any help would be highly appreciated!
I've attached here a gzipped folder of the files I've used in case you wish to try and replicate it. files included: Utila_ESV_table.csv (amplicon table) UTILA.swarms (swarm fasta file) UTILA.stats UTILA_swarm_output swarm2otu_issue.gz
Many thanks in advance, Sanni Hintikka