Open ghost opened 1 year ago
For the trim step, I corrected my script about the paired parameters.
Paired-end options: The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are applied to R2 (second read in pair)
-A ADAPTER 3' adapter to be removed from R2 -G ADAPTER 5' adapter to be removed from R2 -B ADAPTER 5'/3 adapter to be removed from R2 -U LENGTH Remove LENGTH bases from R2 -Q [5'CUTOFF,]3'CUTOFF
# trim R1 5' end adapters and R2 3' end adapters
cutadapt -e 0.1 -n 1 -O 1 -q 6 -m 0:46 -g $R1_5'_end_adapters -A $R2_3'_end_adapters \
-o $sample.R1.a5trim.fastq.gz -p $sample.R2.a3trim.fastq.gz $raw_name.R1.fastq.gz $raw_name.R2.fastq.gz > $sample.a3a5trim.umi.round1.log
# trim R1 3' end adapters and R2 5' end adapters
cutadapt -e 0.1 -n 1 -O 7 -q 6 -m 0:46 -a $R1_3'_end_adapters -G $R2_5'_end_adapters \
-o $sample.R1.a3a5trim.fastq.gz -p $sample.R2.a3a5trim.fastq.gz $sample.R1.a5trim.fastq.gz $sample.R2.a3trim.fastq.gz > $sample.a3a5trim.umi.round2.log
I figured out why my $sample.R1.a3a5trim.fastq.gz results were empty before. It's happened in the umi_tools extract step. I added the step to remove header /1 and /2 suffixes, and I got the expected results.
The /1 and /2 suffixes in the FASTQ headers are commonly used to denote which member of a paired-end read a particular sequence belongs to. The /1 suffix indicates the forward read, and the /2 suffix indicates the reverse read. However, when processing paired-end reads using tools like umi_tools, these suffixes can cause issues if the tool expects the identifiers for paired reads to be identical. In the case of umi_tools extract, the tool is attempting to match paired-end reads based on their identifiers, and the presence of differing suffixes (i.e., /1 and /2) causes a mismatch.
# delete header /1, /2
zcat $sample.R1.a3a5trim.fastq.gz | awk '{if(NR%4==1) gsub("/1",""); print}' | gzip > $sample.R1.a3a5trim.fixed.fastq.gz
zcat $sample.R2.a3a5trim.fastq.gz | awk '{if(NR%4==1) gsub("/2",""); print}' | gzip > $sample.R2.a3a5trim.fixed.fastq.gz
# extract R2 5' end barcodes (6-mer)
umi_tools extract --random-seed=123 --extract-method=regex --bc-pattern=".*" --bc-pattern2="^(?P<umi_1>.{6}).*" -I $sample.R1.a3a5trim.fixed.fastq.gz --read2-in=$sample.R2.a3a5trim.fixed.fastq.gz \
--stdout=$sample.R1.a3a5trim.umi.fastq.gz --read2-out=$sample.R2.a3a5trim.umi.fastq.gz -L $sample.a3a5trim.umi.log
Dear Shun, Hope you have a great weenkend~ I got the mapping results and am now running the ivt model to get modification final results.
We have three ftom replications and one ivt sample: ftom_Rep1 ftom_Rep2 ftom_Rep3 ivt_Rep1
so when generating each ftom replication ivt results, I am using two methods:
1. generate one count table using three ftom and one ivt, and use it to run 3.three_ftom_one_ivt_model.R to get one modification table. I modify the following part codes of 3_run_model_ivt.R to generate 3.three_ftom_one_ivt_model.R
if (rep == "Y") {
in_file_data$ftom_G_count<-in_file_data$ftom_G_1+in_file_data$ftom_G_2
in_file_data$ftom_A_count<-in_file_data$ftom_A_1+in_file_data$ftom_A_2
in_file_data$ivt_G_count<-in_file_data$ivt_G_1+in_file_data$ivt_G_2
in_file_data$ivt_A_count<-in_file_data$ivt_A_1+in_file_data$ivt_A_2
to
# process count table
if (rep == "Y") {
in_file_data$ftom_G_count <- in_file_data$ftom_G_1 + in_file_data$ftom_G_2 + in_file_data$ftom_G_3
in_file_data$ftom_A_count <- in_file_data$ftom_A_1 + in_file_data$ftom_A_2 + in_file_data$ftom_A_3
in_file_data$ivt_G_count <- in_file_data$ivt_G_1
in_file_data$ivt_A_count <- in_file_data$ivt_A_1
Which method do you think is better?
Here is the script for your reference:
#!/bin/bash
#SBATCH -c
#SBATCH --mem=
#SBATCH -t 0-time
#SBATCH --mail-user=EMAIL
#SBATCH --mail-type=ALL
#SBATCH --job-name="3.run_model_ivt"
#SBATCH --output=%x.out
#SBATCH -a 1-4
######################
# Begin work section #
######################
echo Starting Time is `date "+%Y-%m-%d %H:%M:%S"`
start=$(date +%s)
#############################
cd 2.map_count_reads/hisat-3n/count_table
# group_names.txt:
GROUP=$(sed -n ${SLURM_ARRAY_TASK_ID}p group_names.txt)
# Variable declarations
ftom=$GROUP
ftom_Rep1="${GROUP}_Rep1"
ftom_Rep2="${GROUP}_Rep2"
ftom_Rep3="${GROUP}_Rep3"
ivt_Rep1=IVT_$GROUP
# prepare pileup2var.flt.txt
cp ../hisat3n_output/$ftom_Rep1/$ftom_Rep1.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ftom_Rep2/$ftom_Rep2.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ftom_Rep3/$ftom_Rep3.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ivt_Rep1/$ivt_Rep1.GRCh38_tran.pileup2var.flt.txt ./
# load bedtools
module load StdEnv/2020 bedtools/2.29.2
# bed file of ftom_Rep1
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep1.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep1.pileup2var.flt.bed
# bed file of ftom_Rep2
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep2.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep2.pileup2var.flt.bed
# bed file of ftom_Rep3
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep3.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep3.pileup2var.flt.bed
# bed file of ivt_Rep1
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ivt_Rep1.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ivt_Rep1.pileup2var.flt.bed
################# Method 1 ################################################
# 4 samples count table: 3 ftom and 1 ivt
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ftom_Rep2.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ftom_Rep3.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ivt_Rep4.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_1","ftom_A_1","ftom_G_2","ftom_A_2","ftom_G_3","ftom_A_3","ivt_G_1","ivt_A_1"} {split($4,a,"=");split($10,b,"=");split($17,c,"=");split($24,d,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3],c[2],c[3],d[2],d[3]}' > $ftom.ivt.count.table.txt
# Load R. Adjust the version and add the gcc module used for installing packages.
module load StdEnv/2020 gcc/9.3.0 r/4.3.1
# R script: 3.three_ftom_one_ivt_model.R
# ftom.ivt.count.table.out.txt
Rscript 3.three_ftom_one_ivt_model.R $ftom.ivt.count.table.txt $ftom.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1
###################### Method 2 #####################################################
# generate individual count table for three ftom samples: ftom_Rep1.ivt, ftom_Rep2.ivt, ftom_Rep3.ivt
# ftom_Rep1 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep1.ivt.count.table.txt
# ftom_Rep2 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep2.ivt.count.table.txt
# ftom_Rep3 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep3.ivt.count.table.txt
# Adjust version and add the gcc module used for installing packages.
module load StdEnv/2020 gcc/9.3.0 r/4.3.1
# R script: 3_run_model_ivt.R
# ftom_Rep1.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep1.ivt.count.table.txt $ftom_Rep1.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1
# ftom_Rep2.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep2.ivt.count.table.txt $ftom_Rep2.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1
# ftom_Rep3.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep3.ivt.count.table.txt $ftom_Rep3.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1
wait
echo Ending Time is `date "+%Y-%m-%d %H:%M:%S"`
end=$(date +%s)
time=$(( ($end - $start) / 60 ))
echo Used Time is $time mins
Comparing the two methods' count table results, I will use the Method 2 results.
For Method 1:
after intersectBed, I got empty ftom.ivt.count.table.txt.
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ftom_Rep2.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ftom_Rep3.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ivt_Rep4.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_1","ftom_A_1","ftom_G_2","ftom_A_2","ftom_G_3","ftom_A_3","ivt_G_1","ivt_A_1"} {split($4,a,"=");split($10,b,"=");split($17,c,"=");split($24,d,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3],c[2],c[3],d[2],d[3]}' > $ftom.ivt.count.table.txt
$ cat ftom.ivt.count.table.txt pos motif type ftom_G_1 ftom_A_1 ftom_G_2 ftom_A_2 ftom_G_3 ftom_A_3 ivt_G_1 ivt_A_1
For Method 2:
since only one replication for ftom and ivt, I change the parameter Y to N.
# change Y to N for one replicate
Rscript 3_run_model_ivt.R $ftom_Rep1.ivt.count.table.txt $ftom_Rep1.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1
Rscript 3_run_model_ivt.R $ftom_Rep2.ivt.count.table.txt $ftom_Rep2.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1
Rscript 3_run_model_ivt.R $ftom_Rep3.ivt.count.table.txt $ftom_Rep3.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1
Hi Zongmin,
I prefer Method 2 as you can check the reproducibility of samples. If it is great, you can use Method 1 to merge samples and get more depth to do the analysis.
It is weird to get empty table using Method 1. Can you check the depth and reads you get from the samples? Maybe you can get some hints from Method 2 as you did pair-wise intersections among the samples.
Dear Shun, Thank you for the eTAM workflow.
After running the trim step, I get fastq results empty.
I also want to inquire about some details about the mapping step.
Here are my workflow details.
1.trim_adapters_and_extract_barcodes.sh
2.1 Build index I downloaded GRCh38.v110 from the ensemble website (without chr behind chromosome number, eg: >1) and extracted rRNA Sequences based on Genome and gtf files. I built the standard index for the rRNA and repeat index for the Genome.
2.2 map_and_count_reads
3.IVT model In our case we have three replicates of modification samples and one ivt control sample. When we use the ivt model, do we need to change parameters except merge the samples to get the count table? Looking forward to your reply.
Best regards, Zongmin