Minor code issues that need fixing to get Anacapa working

gmteunisse commented 2 years ago

Hi all, I was trying to get the latest version of Anacapa to run through the Singularity container and ran into a few lines of code that need fixing to get Anacapa to run. The first is an erroneous (and double) print statement in line 327 in blca_from_bowtie.py, which results in termination of the script and therefore failure of the pipeline.

The second is a problem in local mode in the run_*_blca.sh scripts, which pass -p ${DB}/muscle as the muscle path to blca_from_bowtie.py, where ${DB} points to the Anacapa_db directory. The result is failure on line 369. This should point to the muscle path as specifiied in anacapa_config.sh (related to issue #40 ?)

As I'm not sure whether this pipeline is still being maintained, I've attached all code that is required to get Anacapa running through Singularity in local mode. NOTE: I get slightly different taxonomy annotations in the 12S example, see issue #60.

Download and modify files

#!/bin/bash
# Path to install Anacapa
BASE_PATH="/path/to/preferred/directory"
ANACAPA_PATH="${BASE_PATH}/anacapa"

# Download singularity container
mkdir ${ANACAPA_PATH}
cd ${ANACAPA_PATH}
wget https://zenodo.org/record/2602180/files/anacapa-1.5.0.img?download=1 \
    -O anacapa-1.5.0.img

# Test if container can be executed and whether muscle is callable
# singularity shell ${ANACAPA_PATH}/anacapa-1.5.0.img
# muscle
# exit

# Clone the Anacapa repository
git clone https://github.com/limey-bean/Anacapa

# Replace the configuration with one for singularity usage
CONFIG_PATH=${ANACAPA_PATH}/Anacapa/Anacapa_db/scripts/anacapa_config.sh
mv ${CONFIG_PATH} ${CONFIG_PATH/config.sh/config.sh.bak}
wget https://raw.githubusercontent.com/dat-ecosystem-archive/anacapa-container/master/config/anacapa_config.sh \
    -O ${CONFIG_PATH}

# Remove a print statement that causes an error in the BLCA procedure
BLCA_PY_PATH=${ANACAPA_PATH}/Anacapa/Anacapa_db/scripts/blca_from_bowtie.py
sed -i.bak '327d' ${BLCA_PY_PATH}

# Alter the path to MUSCLE in the run_blca.sh script - it assumes that
# MUSCLE is callabale from Anacapa_db/muscle, whereas it is called from
# the $PATH variable in the container. This is the default option,
# so the -p option can be removed
BLCA_SH_PATH=${ANACAPA_PATH}/Anacapa/Anacapa_db/scripts/run_blca.sh
BOWTIE_BLCA_SH_PATH=${ANACAPA_PATH}/Anacapa/Anacapa_db/scripts/run_bowtie2_blca.sh
sed -i.bak 's;-p ${DB}/muscle;;g' ${BLCA_SH_PATH}
sed -i.bak 's;-p ${DB}/muscle;;g' ${BOWTIE_BLCA_SH_PATH}

Run the 12S example

#!/bin/bash
# Define paths
BASE_PATH="/path/to/preferred/directory"
ANACAPA_PATH="${BASE_PATH}/anacapa/Anacapa"
CONTAINER_PATH="${BASE_PATH}/anacapa/anacapa-1.5.0.img"

# Unzip the 12S bowtie and tax databases
cd ${ANACAPA_PATH}
unzip ${ANACAPA_PATH}/Example_data/12S_Oct2019.zip 12S_Oct2019/*
mv 12S_Oct2019 ${ANACAPA_PATH}/Anacapa_db/12S

# Test the QC and ASV parsing module
singularity exec -B ${ANACAPA_PATH} ${CONTAINER_PATH} /bin/bash -c "${ANACAPA_PATH}/Anacapa_db/anacapa_QC_dada2.sh \
    -i ${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data \
    -o ${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_time_test \
    -d ${ANACAPA_PATH}/Anacapa_db \
    -f ${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data/forward.txt \
    -r ${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_test_data/reverse.txt \
    -e ${ANACAPA_PATH}/Anacapa_db/metabarcode_loci_min_merge_length.txt \
    -a nextera \
    -t MiSeq \
    -l"

# Compare the output with the expected output. Should not return any lines.
OUT="${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_time_test/12S"
EXP="${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/Anacapa_test_data_expected_output_after_QC_dada2/12S"
git diff --no-index --stat ${EXP} ${OUT}

# Test the taxonomic classification module
singularity exec -B ${ANACAPA_PATH} ${CONTAINER_PATH} /bin/bash -c \
    "${ANACAPA_PATH}/Anacapa_db/anacapa_classifier.sh \
    -o ${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_time_test \
    -d ${ANACAPA_PATH}/Anacapa_db \
    -l"

# Compare the output with the expected output
OUT="${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/12S_time_test/12S"
EXP="${ANACAPA_PATH}/Example_data/12S_example_anacapa_QC_dada2_and_BLCA_classifier/Anacapa_test_data_expected_output_after_classifier/12S"
git diff --no-index --stat ${EXP} ${OUT}

Edit: added reference to issue #60.

Sbu211 commented 2 years ago

Hi thanks for this. I have been struggling with the Anacapa pipeline for days now. I need to get it working so I can run it through my eDNA data for my MSc project. One quick question, the line of code you just put there above do i run it straight on my terminal or should I copy and paste it into a script and run it once as a script? I am still new to Llinux and its commands, I have a ubuntu 20.0 LTS installed on a VM on my laptop. I tried following the tutorial that the Anacapa authors put out, however when I enter the container and run the "run-anacapa-qc.sh file I get an error. So I am hoping your line of code helps

gmteunisse commented 2 years ago

No worries, glad I could help. Either option will work: line by line in the terminal, or copy to a .sh script and running it with bash <script>.sh. If you go for the latter, copy the two code blocks into two separate shell scripts and run each separately. The first will download, install and update the code, the second will run the 12S example.

Sbu211 commented 2 years ago

Thank you so much for this it really helped. Code ran smoothly...without it I would have given up and used another pipeline I am considering The pipeline is similar but everything can be run on R. Basically Dada2 however I assign the taxonomy using assignTaxonomy or IdTaxa (both R functions from the DECIPHER package). If you have tried it or heard about it, please advice on it.

But all in all thank you for this script. I really appreciate this.

I however received slightly different results to your run on https://github.com/limey-bean/Anacapa/issues/60

.../12S/12S_taxonomy_tables/12S_ASV_taxonomy_brief.txt | 32 +-- .../12S/12S_taxonomy_tables/12S_ASV_taxonomy_detailed.txt | 32 +-- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/100/12S_ASV_raw_taxonomy_100.txt | 8 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/100/12S_ASV_sum_by_taxonomy_100.txt | 6 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/40/12S_ASV_raw_taxonomy_40.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/40/12S_ASV_sum_by_taxonomy_40.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/50/12S_ASV_raw_taxonomy_50.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/50/12S_ASV_sum_by_taxonomy_50.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/60/12S_ASV_raw_taxonomy_60.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/60/12S_ASV_sum_by_taxonomy_60.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/70/12S_ASV_raw_taxonomy_70.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/70/12S_ASV_sum_by_taxonomy_70.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/80/12S_ASV_raw_taxonomy_80.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/80/12S_ASV_sum_by_taxonomy_80.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/90/12S_ASV_raw_taxonomy_90.txt | 4 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/90/12S_ASV_sum_by_taxonomy_90.txt | 2 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/95/12S_ASV_raw_taxonomy_95.txt | 8 +- .../12S/12S_taxonomy_tables/Summary_by_percent_confidence/95/12S_ASV_sum_by_taxonomy_95.txt | 7 +- .../12S/12Sbowtie2_out/12S_bowtie2_all.sam | 702 ++++++++++++++++++++++++++----------------------- .../12S/12Sbowtie2_out/12S_bowtie2_all.sam.blca.out | 32 +-- .../12S/12Sbowtie2_out/single_read_forward_12S_end_to_end.sam | 33 ++- .../12S/12Sbowtie2_out/single_read_forward_12S_local.sam | 118 ++++----- .../12S/12Sbowtie2_out/single_read_merged_12S_end_to_end.sam | 351 ++++++++++++++----------- .../12S/12Sbowtie2_out/single_read_merged_12S_local.sam | 200 +++++++------- 24 files changed, 828 insertions(+), 737 deletions(-)

If I may ask? Should I want to run this on COI data I would have to tweak the scripts your script and the scripts that come with the Anacapa pipeline? i.e the paths, metabarcode

gmteunisse commented 2 years ago

Glad it worked! I can vouch for DADA2's ease of use and denoising capacities (it's also part of Anacapa, actually), but I've never used it for COI taxonomy assignment - there is no default training set to use with RDP. You might be able to get some inspiration here if you do want to go down that path: https://github.com/benjjneb/dada2/issues/922.

If you want to run Anacapa on COI data, you would have to download the COI taxonomy database (or generate your own using CRUX) and then tweak some of the paths and scripts. I haven't used Anacapa since running the example, though, so I can't help you with that. I think the tutorial can help you figure out what needs to go where. All the best!

Sbu211 commented 2 years ago

Thank you very much.. I will surely do that.

limey-bean / Anacapa

Minor code issues that need fixing to get Anacapa working #59

Download and modify files

Run the 12S example