valeriafloral / Abies_foliar_mycobiome

In this repository you can find data and scripts used in the project Fungal endophytes and their role in sacred fir's resistance to ozone
GNU General Public License v3.0
2 stars 0 forks source link

RegEx in files names #11

Closed valeriafloral closed 3 years ago

valeriafloral commented 3 years ago

Context

When working with meta"omics" data, it could be possible to separate the host reads if a reference is available.

image Fig 1. Mapping reads to a reference.

For me, it is possible to remove the host reads because I can use the Abies balsamea transcriptome as a reference. However, after the quality filer and the mapping to the reference the remaining reads are presented in a very low percentage.

image Fig. 2 Total number of reads per sample.

In order to keep as much intormation as posssible, I'd like to use the unpaired reads. After the quality filter I used the BWA software to map the reads to the reference transcriptome. My mapping script with paired reads has 4 steps:

For analise the unpaired data I must run each dataset separately (R1 and R2). So I need to run the previous 4 steps twice.

My final script has 75 lines, so I ask you if you could help me to improve my script in order not to run the unpair analysis twice.

My Script is here

The filtered files names are [Sample] [L007] [R1/R2] [001] [paired/unpaired].fq.gz For example: DPVR1_S179_L007_R1_001_paired.fq.gz

In my script I'm using the sample name, but maybe there's an easiest form with RegEx.

for f in DPVR1_S179 DPVR2_S180 DPVR3_S181 DPVR4_S182 DPVR5_S183 DPVR6_S184 DPVR7_S185 DPVR8_S186 DPVR9_S187 DPVR10_S188 DPVR11_S189 DPVR12_S190 DPVR13_S191 DPVR14_S192 DPVR15_S193 DPVR16_S194;
do bwa mem -t 4 ../data/filter/reference/GCAT_AB-RNA-1.0.16.fa ../data/filter/outputs/${f}_L007_R1_001_paired.fq.gz ../data/filter/outputs/${f}_L007_R2_001_paired.fq.gz > ../data/filter/outputs/${f}_paired.sam;
done

The total files are here

Here is my repo repl.it Thank you

bgrueda commented 3 years ago

for i in trimmed*R1.fastq; do bwa mem hg38.fasta $i ${i%1.fastq}2.fastq > aln${i%R1.fastq}.sam 2>bwa${i%R1.fastq}.log; done

Look at this option for Regex to differentiate R1 and R2 and we could play with it to complete the command line.

IsauraRReinhold commented 3 years ago

Maybe yo can use something like this

for i in ls ../data/filter/outputs/ | grep "*_/paired.fq.gz|unpaired.fq.gz" | do

IsauraRReinhold commented 3 years ago

I do not know if somethig like this may works

` for i in ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_paired.fq.gz//"| sed "s/_unpaired.fq.gz//" | uniq do

samtools view -bS ../data/filter/outputs/${i}_paired.sam > ../data/filter/outputs/${i}_paired.bam; samtools view -bS ../data/filter/outputs/${i}_unpaired.sam > ../data/filter/outputs/${i}_unpaired.bam;

done

bgrueda commented 3 years ago

for i in ls ../data/filter/outputs/ | grep "*_/paired.fq.gz|unpaired.fq.gz" | do bwa mem -t 4 ../data/filter/reference/GCAT_AB-RNA-1.0.16.fa $i ${i%1_001_paired.fq.gz}2_001_paired.fq.gz > ${i%R1.fastq}_paired.sam | bwa mem -aM -t 4 ../data/filter/reference/GCAT_AB-RNA-1.0.16.fa ../data/filter/outputs/${i}_L007_R?_001_unpaired.fq.gz > ../data/filter/outputs/${i}_R?_unpaired.sam; done.

I don't know if could be useful something like this. I didn't probe it yet.

valeriafloral commented 3 years ago

Until now, with your suggestions, I have been able to do a test script.

I modified the command to extract the samples files:

for f in `ls ../data/filter/bwatest/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;

It works with echoes tests. I am testing it in a cluster with a subset, so I am not going to close the issue until I'm sure it works.

valeriafloral commented 3 years ago

My analysis just finished and IT WORKED!!!!! Thank you very much for your help!

This is the final script (It ended with 43 lines) :D!!!:

#!/bin/bash

#SBATCH -w nodo3
#SBATCH -n 6

# Valeria Flores
#26/01/2021
#Remove Abies balsamea reads with BWA

#Make index
bwa index -a bwtsw ../data/filter/reference/GGJG01.1.fasta
samtools faidx ../data/filter/reference/GGJG01.1.fasta
makeblastdb -in ../data/filter/reference/GGJG01.1.fasta -dbtype nucl

#Perform alignments for paired and unpaired reads with BWA
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do bwa mem -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R1_001_paired.fq.gz ../data/filter/outputs/${f}_L007_R2_001_paired.fq.gz > ../data/filter/outputs/${f}_paired.sam;
bwa mem -aM -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R1_001_unpaired.fq.gz > ../data/filter/outputs/${f}_R1_unpaired.sam;
bwa mem -aM -t 4 ../data/filter/reference/GGJG01.1.fasta ../data/filter/outputs/${f}_L007_R2_001_unpaired.fq.gz > ../data/filter/outputs/${f}_R2_unpaired.sam;
done

#Convert .sam to .bam using samtools
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools view -bS ../data/filter/outputs/${f}_paired.sam > ../data/filter/outputs/${f}_paired.bam;
samtools view -bS ../data/filter/outputs/${f}_R1_unpaired.sam > ../data/filter/outputs/${f}_R1_unpaired.bam;
samtools view -bS ../data/filter/outputs/${f}_R2_unpaired.sam > ../data/filter/outputs/${f}_R2_unpaired.bam;
done

#Generate fastq outputs for all reads that did not map to the host reference genome (-f 4)
#This is the datatset that is going to be used
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_paired_bulk.fastq ../data/filter/outputs/${f}_paired.bam;
samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_R1_unpaired_bulk.fastq ../data/filter/outputs/${f}_R1_unpaired.bam;
samtools fastq -n -f 4 -0 ../data/filter/outputs/${f}_R2_unpaired_bulk.fastq ../data/filter/outputs/${f}_R2_unpaired.bam;
done

#Statistics
for f in `ls ../data/filter/outputs/ | grep ".fq.gz" | sed "s/_L007_R1_001_paired.fq.gz//"| sed "s/_L007_R2_001_paired.fq.gz//" | sed "s/_L007_R1_001_unpaired.fq.gz//"| sed "s/_L007_R2_001_unpaired.fq.gz//"| uniq`;
do samtools stats ../data/filter/outputs/${f}_paired.bam > ../data/reports/mapping/${f}_hostmap_paired.txt;
samtools stats ../data/filter/outputs/${f}_R1_unpaired.bam > ../data/reports/mapping/${f}_hostmap_R1_unpaired.txt;
samtools stats ../data/filter/outputs/${f}_R2_unpaired.bam > ../data/reports/mapping/${f}_hostmap_R2_unpaired.txt;
done

Finally I can close the issue.