transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
56 stars 36 forks source link

Where is the File 2 (R2) or reverse file in the script? #3

Closed camilleberrocal closed 6 years ago

camilleberrocal commented 6 years ago

I’ve been looking at the master_script.sh of SAMSA2 and I’ve noticed that the line to identify the reverse file or the R2 file is not really clear. Instead it shows an echo for the file 1 (R1) and ‘awk’ command that will out-put a file with a printed name of R2.

Considering that two independent files exist, for reverse and forward sequences and SAMSA2 can call out one of the files, why does it not call file 2 (R2)?


STEP 1: MERGING OF PAIRED-END FILES USING PEAR Note: paired-end files are usually named using R1 and R2 in the name. Example: control_1.R1.fastq control_1.R2.fastq

Note: if using single-end sequencing, skip this step (comment out). Note: if performing R analysis (step 6), be sure to name files with the appropriate prefix ("control$file" and "experimental$file")!

cd $starting_files_location for file in $starting_files_location/*.gz do gunzip $file done

for file in $starting_files_location/_R1 do file1=$file *file2=echo $file1 | awk -F "R1" '{print $1 "R2" $2}' out_path=echo $file | awk -F "_R1" '{print $1 ".merged"}' out_name=`echo ${out_path##/}`**

$pear_location/pear -f $file1 -r $file2 -o $out_name

done

mkdir $starting_location/step_1_output/ mv $starting_files_location/merged $starting_location/step_1_output/ echo -e "\nPaired-end merging step completed.\n"

####################################################################

I would like to know what exactly is the command doing for file 2 (R2) ?

Thank you for your time, Camille

transcript commented 6 years ago

Hi Camille,

This script is assuming that all of the forward and reverse files are placed in the same directory, presumably fresh from the sequencer. For example, a directory might contain the files:

The lines of the SAMSA2 script that you pasted here differentiate between the forward and reverse files. Each forward (contains "_R1") file is saved as $file1. The script then echoes this file name, splits at "R1", and instead prints "R2", saving this as $file2. This way, you don't need to tell the script that "this group of files are forward files, and this other group here are the reverse files." Instead, the script assumes that, for each forward file, the reverse file will have the same name - except for having "R2" instead of "R1".

As for exactly how the $file2 command works:

It echoes $file1's name (echo $file1) This name gets ported into awk, which splits based on the separator given by the -F flag (R1, in this case). The print command prints the different fields of the awk-separated name ($1 prints everything before the R1 split, $2 prints everything after the R1 split) All of this gets returned as a single string, which becomes the $file2 variable.

Best, Sam

On Tue, Apr 24, 2018 at 12:05 PM, camilleberrocal notifications@github.com wrote:

I’ve been looking at the master_script.sh of SAMSA2 and I’ve noticed that the line to identify the reverse file or the R2 file is not really clear. Instead it shows an echo for the file 1 (R1) and ‘awk’ command that will out-put a file with a printed name of R2.

Considering that two independent files exist, for reverse and forward sequences and SAMSA2 can call out one of the files, why does it not call file 2 (R2)?

#################################################################### STEP 1: MERGING OF PAIRED-END FILES USING PEAR Note: paired-end files are usually named using R1 and R2 in the name. Example: control_1.R1.fastq control1.R2.fastq Note: if using single-end sequencing, skip this step (comment out). Note: if performing R analysis (step 6), be sure to name files with the appropriate prefix ("control$file" and "experimental_$file")!

cd $starting_files_location for file in $starting_files_location/*.gz do gunzip $file done

for file in $starting_files_location/_R1 do file1=$file

file2=echo $file1 | awk -F "R1" '{print $1 "R2" $2}' out_path=echo $file | awk -F "_R1" '{print $1 ".merged"}' out_name=echo ${out_path##/}*

$pear_location/pear -f $file1 -r $file2 -o $out_name

done

mkdir $starting_location/step_1_output/ mv $starting_files_location/merged $starting_location/step_1_output/ echo -e "\nPaired-end merging step completed.\n"

####################################################################

I would like to know what exactly is the command doing for file 2 (R2) ?

Thank you for your time, Camille

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/transcript/samsa2/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AGz51nimoJm2oQICSWyQ7CpAKFlcYWN4ks5tr3eOgaJpZM4TiP2f .

-- Sam Westreich DEB Biotechnology Fellow Integrative Genetics and Genomics Graduate Group, University of California, Davis College of Biological Sciences, University of Minnesota

Are you doing what you want to be doing?

camilleberrocal commented 6 years ago

Thank you so much I appreciate the clarification! I go it now. It's rearranging the name before calling the variable.

Thanks!