Closed valeriafloral closed 3 years ago
for i in trimmed*R1.fastq; do bwa mem hg38.fasta $i ${i%1.fastq}2.fastq > aln${i%R1.fastq}.sam 2>bwa${i%R1.fastq}.log; done
Look at this option for Regex to differentiate R1 and R2 and we could play with it to complete the command line.
Maybe yo can use something like this
for i in ls ../data/filter/outputs/ | grep "*_/paired.fq.gz|unpaired.fq.gz" | do
I do not know if somethig like this may works
`
for i in ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_paired.fq.gz//"| sed "s/_unpaired.fq.gz//" | uniq
do
samtools view -bS ../data/filter/outputs/${i}_paired.sam > ../data/filter/outputs/${i}_paired.bam; samtools view -bS ../data/filter/outputs/${i}_unpaired.sam > ../data/filter/outputs/${i}_unpaired.bam;
done
for i in ls ../data/filter/outputs/ | grep "*_/paired.fq.gz|unpaired.fq.gz" | do bwa mem -t 4 ../data/filter/reference/GCAT_AB-RNA-1.0.16.fa $i ${i%1_001_paired.fq.gz}2_001_paired.fq.gz > ${i%R1.fastq}_paired.sam | bwa mem -aM -t 4 ../data/filter/reference/GCAT_AB-RNA-1.0.16.fa ../data/filter/outputs/${i}_L007_R?_001_unpaired.fq.gz > ../data/filter/outputs/${i}_R?_unpaired.sam; done.
I don't know if could be useful something like this. I didn't probe it yet.
Until now, with your suggestions, I have been able to do a test script.
I modified the command to extract the samples files:
for f in `ls ../data/filter/bwatest/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
It works with echo
es tests. I am testing it in a cluster with a subset, so I am not going to close the issue until I'm sure it works.
My analysis just finished and IT WORKED!!!!! Thank you very much for your help!
This is the final script (It ended with 43 lines) :D!!!:
#!/bin/bash
#SBATCH -w nodo3
#SBATCH -n 6
# Valeria Flores
#26/01/2021
#Remove Abies balsamea reads with BWA
#Make index
bwa index -a bwtsw ../data/filter/reference/GGJG01.1.fasta
samtools faidx ../data/filter/reference/GGJG01.1.fasta
makeblastdb -in ../data/filter/reference/GGJG01.1.fasta -dbtype nucl
#Perform alignments for paired and unpaired reads with BWA
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do bwa mem -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R1_001_paired.fq.gz ../data/filter/outputs/${f}_L007_R2_001_paired.fq.gz > ../data/filter/outputs/${f}_paired.sam;
bwa mem -aM -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R1_001_unpaired.fq.gz > ../data/filter/outputs/${f}_R1_unpaired.sam;
bwa mem -aM -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R2_001_unpaired.fq.gz > ../data/filter/outputs/${f}_R2_unpaired.sam;
done
#Convert .sam to .bam using samtools
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools view -bS ../data/filter/outputs/${f}_paired.sam > ../data/filter/outputs/${f}_paired.bam;
samtools view -bS ../data/filter/outputs/${f}_R1_unpaired.sam > ../data/filter/outputs/${f}_R1_unpaired.bam;
samtools view -bS ../data/filter/outputs/${f}_R2_unpaired.sam > ../data/filter/outputs/${f}_R2_unpaired.bam;
done
#Generate fastq outputs for all reads that did not map to the host reference genome (-f 4)
#This is the datatset that is going to be used
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_paired_bulk.fastq ../data/filter/outputs/${f}_paired.bam;
samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_R1_unpaired_bulk.fastq ../data/filter/outputs/${f}_R1_unpaired.bam;
samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_R2_unpaired_bulk.fastq ../data/filter/outputs/${f}_R2_unpaired.bam;
done
#Statistics
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools stats ../data/filter/outputs/${f}_paired.bam > ../data/reports/mapping/${f}_hostmap_paired.txt;
samtools stats ../data/filter/outputs/${f}_R1_unpaired.bam > ../data/reports/mapping/${f}_hostmap_R1_unpaired.txt;
samtools stats ../data/filter/outputs/${f}_R2_unpaired.bam > ../data/reports/mapping/${f}_hostmap_R2_unpaired.txt;
done
Finally I can close the issue.
Context
When working with meta"omics" data, it could be possible to separate the host reads if a reference is available.
Fig 1. Mapping reads to a reference.
For me, it is possible to remove the host reads because I can use the Abies balsamea transcriptome as a reference. However, after the quality filer and the mapping to the reference the remaining reads are presented in a very low percentage.
Fig. 2 Total number of reads per sample.
In order to keep as much intormation as posssible, I'd like to use the unpaired reads. After the quality filter I used the BWA software to map the reads to the reference transcriptome. My mapping script with paired reads has 4 steps:
For analise the unpaired data I must run each dataset separately (R1 and R2). So I need to run the previous 4 steps twice.
My final script has 75 lines, so I ask you if you could help me to improve my script in order not to run the unpair analysis twice.
My Script is here
The filtered files names are [Sample] [L007] [R1/R2] [001] [paired/unpaired].fq.gz For example: DPVR1_S179_L007_R1_001_paired.fq.gz
In my script I'm using the sample name, but maybe there's an easiest form with RegEx.
The total files are here
Here is my repo repl.it Thank you