torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

OTU table without abundance values #168

Closed AshleyZhao20 closed 2 years ago

AshleyZhao20 commented 2 years ago

Hi, I was trying to make an OTU table with the pipeline here, but the ultimate OTU table turned out of lacking abundance values. I was wondering if there are format problems or not, so I used your methods in #162 issue, but unfortunately, it didn't work out. I would be grateful if you can indicate the problem of my input files. By the way, the primary data are PacBio sequencing data. My input files are attached here. The files include amplicon_contingency_table, amplicons.swarms and amplicons.stats.

test20211130.zip

Many thanks! Ashley

frederic-mahe commented 2 years ago

Hi Ashley, sorry for not answering earlier. I'll look into your data soon, and come up with a more complete answer.

AshleyZhao20 commented 2 years ago

Hi Frédéric, thank you for the notification. I’ve been looking forward to the answer.

frederic-mahe commented 2 years ago

Here is the code that builds the final table:

STATS="amplicons.stats"
SWARMS="amplicons.swarms"
AMPLICON_TABLE="amplicon_contingency_table.csv"
OTU_TABLE="OTU_contingency_table.csv"

# Header
echo -e "OTU\t$(head -n 1 "${AMPLICON_TABLE}")" > "${OTU_TABLE}"

# Compute "per sample abundance" for each OTU
awk -v SWARM="${SWARMS}" \
    -v TABLE="${AMPLICON_TABLE}" \
    'BEGIN {FS = " "
            while ((getline < SWARM) > 0) {
                swarms[$1] = $0
            }
            FS = "\t"
            while ((getline < TABLE) > 0) {
                table[$1] = $0
            }
           }

     {# Parse the stat file (OTUs sorted by decreasing abundance)
      seed = $3
      n = split(swarms[seed], OTU, " ")
      for (i = 1; i < n; i = i + 2) {
          s = split(table[OTU[i]], abundances, "\t")
          for (j = 1; j < s; j++) {
              samples[j] += abundances[j+1]
          }
      }
      printf "%s\t%s", NR, $3
      for (j = 1; j < s; j++) {
          printf "\t%s", samples[j]
      }
     printf "\n"
     delete samples
     }' "${STATS}" >> "${OTU_TABLE}"

I've changed the line seed = $3 "_" $4 for seed = $3 to adapt the code to the your amplicon name format.

However, your dataset seems to be incomplete. For example, amplicon TDA23-3508 forms its own cluster and is marked as representing 6 reads (see amplicons.stats), but a look at amplicon_contingency_table.csv shows that TDA23-3508 has a total of 2 reads, all from sample TDA23. So 4 reads are missing here, and the resulting OTU table will likely be wrong.

Did you perform a global dereplication of your fasta data before clustering?

AshleyZhao20 commented 2 years ago

Hi Frédéric, Thank you for the efforts, I really appreciated that. The code works for my data, but some OTUs are still lack of abundances. I do perform a global dereplication using USEARCH before clustering, is that the reason why the OTU table lose some information?

All the best Ashley

frederic-mahe commented 2 years ago

I do perform a global dereplication using USEARCH before clustering, is that the reason why the OTU table lose some information?

Performing a global dereplication is strongly recommended, so it is good you did it. When dereplicating, did you use the --sizein option to take into account the per-sample abundance values?

There are count differences between your stats, swarms and table files, so something is wrong. However, it is hard for me to say exactly what step of your process went wrong.

AshleyZhao20 commented 2 years ago

Sorry for the delay. I added -sizein option and redone the whole process, but unfortunately, there is still some absence data. I'm gonna try several more times and see if I can work this out. Thank you again for your effort.