eTAM workflow details - Githubissues

ghost commented 1 year ago

Dear Shun, Thank you for the eTAM workflow.

After running the trim step, I get fastq results empty.

I also want to inquire about some details about the mapping step.

Here are my workflow details.

1.trim_adapters_and_extract_barcodes.sh

#!/bin/bash

######################
# Begin work section #
######################

# sample name: MEP_Rep1
sample=MEP_Rep1

R1_5'_end_adapters = sequence1(not same in your paper)
R2_3'_end_adapters = sequence2(not same in your paper)
R1_3'_end_adapters = sequence3(not same in your paper)
R2_5'_end_adapters = sequence4(not same in your paper)

echo Starting Time is `date "+%Y-%m-%d %H:%M:%S"`
start=$(date +%s)

# trim R1 5' end adapters and R2 3' end adapters 
cutadapt -e 0.1 -n 1 -O 1 -q 6 -m 0:46 -g $R1_5'_end_adapters -G $R2_3'_end_adapters \
    -o $sample.R1.a5trim.fastq.gz -p $sample.R2.a3trim.fastq.gz $sample.R1.fastq.gz $sample.R2.fastq.gz > $sample.a3a5trim.umi.round1.log

# trim R1 3' end adapters and R2 5' end adapters
cutadapt -e 0.1 -n 1 -O 7 -q 6 -m 0:46 -a $R1_3'_end_adapters -A $R2_5'_end_adapters \
    -o $sample.R1.a3a5trim.fastq.gz -p $sample.R2.a3a5trim.fastq.gz $sample.R1.a5trim.fastq.gz $sample.R2.a3trim.fastq.gz > $sample.a3a5trim.umi.round2.log

# extract R2 5' end barcodes (6-mer)
umi_tools extract --random-seed=123 --extract-method=regex --bc-pattern=".*" --bc-pattern2="^(?P<umi_1>.{6}).*" -I $sample.R1.a3a5trim.fastq.gz --read2-in=$sample.R2.a3a5trim.fastq.gz \
    --stdout=$sample.R1.a3a5trim.umi.fastq.gz --read2-out=$sample.R2.a3a5trim.umi.fastq.gz -L $sample.a3a5trim.umi.log

rm $sample.R1.a5trim.fastq.gz $sample.R2.a3trim.fastq.gz
rm $sample.R1.a3a5trim.fastq.gz $sample.R2.a3a5trim.fastq.gz

echo -e "\nAll done\n"

echo Ending Time is `date "+%Y-%m-%d %H:%M:%S"`
end=$(date +%s)
time=$(( ($end - $start) / 60 ))
echo Used Time is $time mins

2.1 Build index I downloaded GRCh38.v110 from the ensemble website (without chr behind chromosome number, eg: >1) and extracted rRNA Sequences based on Genome and gtf files. I built the standard index for the rRNA and repeat index for the Genome.

# Extract rRNA Sequences from the Genome
awk '$3=="gene" && /gene_biotype "rRNA"/' genome.gtf > rRNA.gtf
# Convert the rRNA.gtf to a BED format
awk '{print $1"\t"$4-1"\t"$5"\t"$9}' rRNA.gtf > rRNA.bed
# Use bedtools to extract sequences
module load StdEnv/2020 bedtools/2.30.0 
bedtools getfasta -fi genome.fa -bed rRNA.bed -fo rRNA.fa

# Build the HISAT-3N standard index for rRNA. 
# note, when I build repeat index it shows an error, so I build the standard index.
hisat-3n-build -p 8 --base-change A,G rRNA.fa rRNA_tran

# Build the repeat HISAT-3N index (with A to G conversion, require 256 GB memory for human genome index):  
cd hisat-3n/genome_index

hisat-3n-build -p 8 --base-change A,G --repeat-index --ss genome.ss --exon genome.exon genome.fa GRCh38_tran

2.2 map_and_count_reads

#!/bin/bash

######################
# Begin work section #
######################

sample=MEP_Rep1

# test the software can be use
# samtools --help
# hisat-3n -h
# pileup2var -h

echo Starting Time is `date "+%Y-%m-%d %H:%M:%S"`
start=$(date +%s)

## main setting
# sample fastq file path
fastq_dir=2.map_count_reads/trim_umi_fastq
# hisat-3n rRNA index name: e.g., rRNA_stran
hisat3n_rep_index_name=rRNA_tran
# hisat-3n rRNA index path
hisat3n_rep_index=2.map_count_reads/hisat-3n/rRNA_index/$hisat3n_rep_index_name

# hisat-3n genome index name: e.g., GRCh38_tran
hisat3n_index_name=GRCh38_tran
# hisat-3n genome index path
hisat3n_index=2.map_count_reads/hisat-3n/genome_index/$hisat3n_index_name

# hisat-3n alignment result output path
hisat3n_out_dir=2.map_count_reads/hisat-3n/hisat3n_output

# rRNA sequneces in fasta format
rep_fa=2.map_count_reads/hisat-3n/rRNA_index/rRNA.fa
# genome sequneces in fasta format
genome_fa=2.map_count_reads/hisat-3n/genome_index/genome.fa

# threads to run the bash script
ncpus=24

# minimum percentage of converted As in a read to be considered as a valid read
A2G_percent=0.5

# minimum read coverage in a position to be considered for output
cov=1

# library type: F for forward, R for reverse
strandness="F"

# here R2 reads were used.
sample_r2_fq_file=$fastq_dir/$sample.R2.a3a5trim.umi.fastq.gz

echo -e "\n$sample\n"

if [[ -d $hisat3n_out_dir/$sample ]];then rm -rf $hisat3n_out_dir/$sample;fi
mkdir -p $hisat3n_out_dir/$sample

echo -e "\nhisat3n align -- rRNA mapping\n"
hisat-3n -p $ncpus --time --base-change A,G --no-spliced-alignment --no-softclip --norc --no-unal --rna-strandness $strandness -x $hisat3n_rep_index -U $sample_r2_fq_file --un-gz $hisat3n_out_dir/$sample/$sample.rmRep.fastq.gz | \
    samtools sort -@ $ncpus -T $hisat3n_out_dir/$sample -o $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.bam -

samtools index $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.bam

samtools stats -r $rep_fa $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.bam > $sample/$sample.$hisat3n_rep_index_name.align.sorted.stats

# filter reads by the percentage of converted As
samtools view -@ $ncpus -hb -e "([Yf]+[Zf]>0) && ([Yf]/([Yf]+[Zf])>=$A2G_percent)" $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.bam | samtools sort -T $hisat3n_out_dir/$sample -@ $ncpus -o $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.flt.bam -
# Count converted As and unconverted As. Here hisat-3n-table is used as an example. Other similar tools can also be used to complete the task. 
#samtools view -@ $ncpus $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.flt.bam | hisat-3n-table -u -p $ncpus --alignments - --ref $rep_fa --output-name $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.conversion.flt.txt --base-change A,G
# Here, pileup2var is used.
pileup2var -f 524 -s $strandness -g $rep_fa -b $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.align.sorted.flt.bam -o $hisat3n_out_dir/$sample/$sample.$hisat3n_rep_index_name.pileup2var.flt.txt

echo -e "\nhisat3n align -- genome mapping\n"
hisat-3n -p $ncpus --time --base-change A,G --repeat --repeat-limit 1000 --bowtie2-dp 0 --no-unal --rna-strandness $strandness -x $hisat3n_index -U $hisat3n_out_dir/$sample/$sample.rmRep.fastq.gz | \
    samtools view -@ $ncpus -Shb - -o $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.raw.bam

# filter out multiple-loci mapped reads
samtools view -@ $ncpus -q 60 -hb $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.raw.bam | samtools sort -T $hisat3n_out_dir/$sample -@ $ncpus -o $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.bam -
rm $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.raw.bam

samtools index $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.bam

# deduplication
umi_tools dedup --random-seed=123 --method=unique --spliced-is-unique -I $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.bam -S $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.bam \
    --output-stats=$hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup -L $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.log

samtools index $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.bam

samtools stats -r $genome_fa $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.bam > $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.stats

# filter reads by the percentage of converted As
samtools view -@ $ncpus -hb -e "([Yf]+[Zf]>0) && ([Yf]/([Yf]+[Zf])>=$A2G_percent)" $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.bam | samtools sort -T $hisat3n_out_dir/$sample -@ $ncpus -o $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.flt.bam -
# Count converted As and unconverted As. Here hisat-3n-table is used as an example. Other similar tools can also be used to complete the task. 
#samtools view -@ $ncpus $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.flt.bam | hisat-3n-table -p $ncpus --alignments - --ref $genome_fa --output-name $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.conversion.flt.txt --base-change A,G
# Here, pileup2var is used.
pileup2var -t $ncpus -f 524 -a A -c $cov -s $strandness -g $genome_fa -b $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.align.sorted.dedup.flt.bam -o $hisat3n_out_dir/$sample/$sample.$hisat3n_index_name.pileup2var.flt.txt

# rm $hisat3n_out_dir/$sample/$sample.rmRep.fastq.gz
wait

echo Ending Time is `date "+%Y-%m-%d %H:%M:%S"`
end=$(date +%s)
time=$(( ($end - $start) / 60 ))
echo Used Time is $time mins

3.IVT model In our case we have three replicates of modification samples and one ivt control sample. When we use the ivt model, do we need to change parameters except merge the samples to get the count table? Looking forward to your reply.

Best regards, Zongmin

ghost commented 1 year ago

For the trim step, I corrected my script about the paired parameters.

Paired-end options: The -A/-G/-B/-U/-Q options work like their lowercase counterparts, but are applied to R2 (second read in pair)

-A ADAPTER 3' adapter to be removed from R2 -G ADAPTER 5' adapter to be removed from R2 -B ADAPTER 5'/3 adapter to be removed from R2 -U LENGTH Remove LENGTH bases from R2 -Q [5'CUTOFF,]3'CUTOFF

# trim R1 5' end adapters and R2 3' end adapters
cutadapt -e 0.1 -n 1 -O 1 -q 6 -m 0:46 -g $R1_5'_end_adapters -A $R2_3'_end_adapters \
    -o $sample.R1.a5trim.fastq.gz -p $sample.R2.a3trim.fastq.gz $raw_name.R1.fastq.gz $raw_name.R2.fastq.gz > $sample.a3a5trim.umi.round1.log
# trim R1 3' end adapters and R2 5' end adapters
cutadapt -e 0.1 -n 1 -O 7 -q 6 -m 0:46 -a $R1_3'_end_adapters -G $R2_5'_end_adapters \
    -o $sample.R1.a3a5trim.fastq.gz -p $sample.R2.a3a5trim.fastq.gz $sample.R1.a5trim.fastq.gz $sample.R2.a3trim.fastq.gz > $sample.a3a5trim.umi.round2.log

ghost commented 1 year ago

I figured out why my $sample.R1.a3a5trim.fastq.gz results were empty before. It's happened in the umi_tools extract step. I added the step to remove header /1 and /2 suffixes, and I got the expected results.

The /1 and /2 suffixes in the FASTQ headers are commonly used to denote which member of a paired-end read a particular sequence belongs to. The /1 suffix indicates the forward read, and the /2 suffix indicates the reverse read. However, when processing paired-end reads using tools like umi_tools, these suffixes can cause issues if the tool expects the identifiers for paired reads to be identical. In the case of umi_tools extract, the tool is attempting to match paired-end reads based on their identifiers, and the presence of differing suffixes (i.e., /1 and /2) causes a mismatch.

# delete header /1, /2
zcat $sample.R1.a3a5trim.fastq.gz | awk '{if(NR%4==1) gsub("/1",""); print}' | gzip > $sample.R1.a3a5trim.fixed.fastq.gz
zcat $sample.R2.a3a5trim.fastq.gz | awk '{if(NR%4==1) gsub("/2",""); print}' | gzip > $sample.R2.a3a5trim.fixed.fastq.gz

# extract R2 5' end barcodes (6-mer)
umi_tools extract --random-seed=123 --extract-method=regex --bc-pattern=".*" --bc-pattern2="^(?P<umi_1>.{6}).*" -I $sample.R1.a3a5trim.fixed.fastq.gz --read2-in=$sample.R2.a3a5trim.fixed.fastq.gz \
        --stdout=$sample.R1.a3a5trim.umi.fastq.gz --read2-out=$sample.R2.a3a5trim.umi.fastq.gz -L $sample.a3a5trim.umi.log

ghost commented 1 year ago

Dear Shun, Hope you have a great weenkend~ I got the mapping results and am now running the ivt model to get modification final results.

We have three ftom replications and one ivt sample: ftom_Rep1 ftom_Rep2 ftom_Rep3 ivt_Rep1

so when generating each ftom replication ivt results, I am using two methods:

1. generate one count table using three ftom and one ivt, and use it to run 3.three_ftom_one_ivt_model.R to get one modification table. I modify the following part codes of 3_run_model_ivt.R to generate 3.three_ftom_one_ivt_model.R

if (rep == "Y") {
    in_file_data$ftom_G_count<-in_file_data$ftom_G_1+in_file_data$ftom_G_2
    in_file_data$ftom_A_count<-in_file_data$ftom_A_1+in_file_data$ftom_A_2
    in_file_data$ivt_G_count<-in_file_data$ivt_G_1+in_file_data$ivt_G_2
    in_file_data$ivt_A_count<-in_file_data$ivt_A_1+in_file_data$ivt_A_2

to

# process count table
if (rep == "Y") {
  in_file_data$ftom_G_count <- in_file_data$ftom_G_1 + in_file_data$ftom_G_2 + in_file_data$ftom_G_3
  in_file_data$ftom_A_count <- in_file_data$ftom_A_1 + in_file_data$ftom_A_2 + in_file_data$ftom_A_3
  in_file_data$ivt_G_count <- in_file_data$ivt_G_1
  in_file_data$ivt_A_count <- in_file_data$ivt_A_1

generate each ftom and ivt count table, and use three count tables to run 3_run_model_ivt.R to get three modification tables.

Which method do you think is better?

Here is the script for your reference:

#!/bin/bash
#SBATCH -c  
#SBATCH --mem= 
#SBATCH -t 0-time
#SBATCH --mail-user=EMAIL
#SBATCH --mail-type=ALL
#SBATCH --job-name="3.run_model_ivt"
#SBATCH --output=%x.out
#SBATCH -a 1-4

######################
# Begin work section #
######################

echo Starting Time is `date "+%Y-%m-%d %H:%M:%S"`
start=$(date +%s)

#############################

cd 2.map_count_reads/hisat-3n/count_table

# group_names.txt: 

GROUP=$(sed -n ${SLURM_ARRAY_TASK_ID}p group_names.txt)

# Variable declarations
ftom=$GROUP
ftom_Rep1="${GROUP}_Rep1"
ftom_Rep2="${GROUP}_Rep2"
ftom_Rep3="${GROUP}_Rep3"
ivt_Rep1=IVT_$GROUP

# prepare pileup2var.flt.txt
cp ../hisat3n_output/$ftom_Rep1/$ftom_Rep1.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ftom_Rep2/$ftom_Rep2.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ftom_Rep3/$ftom_Rep3.GRCh38_tran.pileup2var.flt.txt ./
cp ../hisat3n_output/$ivt_Rep1/$ivt_Rep1.GRCh38_tran.pileup2var.flt.txt ./

# load bedtools
module load StdEnv/2020 bedtools/2.29.2 

# bed file of ftom_Rep1
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep1.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep1.pileup2var.flt.bed

# bed file of ftom_Rep2
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep2.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep2.pileup2var.flt.bed

# bed file of ftom_Rep3
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ftom_Rep3.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ftom_Rep3.pileup2var.flt.bed

# bed file of ivt_Rep1
awk -F '\t' 'BEGIN {OFS="\t"} FNR>1 && ($6+$9)>=10 {m=sprintf("%.4f",$6/($6+$9)*100);print $1,$2-1,$2,$1"_"$2"_"$3"="$9"="$6"="m,m,$3}' $ivt_Rep1.GRCh38_tran.pileup2var.flt.txt | slopBed -b 2 -i - -g chrom.sizes | fastaFromBed -s -bedOut -fi genome.fa -bed - | awk -F '\t' 'BEGIN {OFS="\t"} length($7)==5 {a=toupper($7);gsub(/T/,"U",a);if(a~/[GAU][AG]AC[ACU]/) {m="DRACH"} else {m="nonDRACH"};print $1,$2+2,$3-2,$4"="a"="m,$5,$6}' > $ivt_Rep1.pileup2var.flt.bed

#################  Method 1  ################################################
# 4 samples count table: 3 ftom and 1 ivt
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ftom_Rep2.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ftom_Rep3.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ivt_Rep4.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_1","ftom_A_1","ftom_G_2","ftom_A_2","ftom_G_3","ftom_A_3","ivt_G_1","ivt_A_1"} {split($4,a,"=");split($10,b,"=");split($17,c,"=");split($24,d,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3],c[2],c[3],d[2],d[3]}' > $ftom.ivt.count.table.txt

# Load R. Adjust the version and add the gcc module used for installing packages.
module load StdEnv/2020 gcc/9.3.0 r/4.3.1

# R script: 3.three_ftom_one_ivt_model.R
# ftom.ivt.count.table.out.txt 
Rscript 3.three_ftom_one_ivt_model.R $ftom.ivt.count.table.txt $ftom.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1

###################### Method 2 #####################################################
# generate individual count table for three ftom samples: ftom_Rep1.ivt, ftom_Rep2.ivt, ftom_Rep3.ivt

# ftom_Rep1 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep1.ivt.count.table.txt

# ftom_Rep2 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep2.ivt.count.table.txt

# ftom_Rep3 vs ivt_Rep1 count table
intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ivt_Rep1.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_count","ftom_A_count","ivt_G_count","ivt_A_count"} {split($4,a,"=");split($10,b,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3]}' > $ftom_Rep3.ivt.count.table.txt

# Adjust version and add the gcc module used for installing packages.
module load StdEnv/2020 gcc/9.3.0 r/4.3.1

# R script: 3_run_model_ivt.R
# ftom_Rep1.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep1.ivt.count.table.txt $ftom_Rep1.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1

# ftom_Rep2.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep2.ivt.count.table.txt $ftom_Rep2.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1

# ftom_Rep3.ivt.count.table.out.txt
Rscript 3_run_model_ivt.R $ftom_Rep3.ivt.count.table.txt $ftom_Rep3.ivt.count.table.out.txt 50000 T 123 Y 10 0.05 0.1

wait

echo Ending Time is `date "+%Y-%m-%d %H:%M:%S"`
end=$(date +%s)
time=$(( ($end - $start) / 60 ))
echo Used Time is $time mins

ghost commented 1 year ago

Comparing the two methods' count table results, I will use the Method 2 results.

For Method 1:

after intersectBed, I got empty ftom.ivt.count.table.txt.

intersectBed -wo -s -a $ftom_Rep1.pileup2var.flt.bed -b $ftom_Rep2.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ftom_Rep3.pileup2var.flt.bed | intersectBed -wo -s -a - -b $ivt_Rep4.pileup2var.flt.bed | awk -F '\t' 'BEGIN {OFS="\t";print "pos","motif","type","ftom_G_1","ftom_A_1","ftom_G_2","ftom_A_2","ftom_G_3","ftom_A_3","ivt_G_1","ivt_A_1"} {split($4,a,"=");split($10,b,"=");split($17,c,"=");split($24,d,"=");print a[1],a[5],a[6],a[2],a[3],b[2],b[3],c[2],c[3],d[2],d[3]}' > $ftom.ivt.count.table.txt

$ cat ftom.ivt.count.table.txt pos motif type ftom_G_1 ftom_A_1 ftom_G_2 ftom_A_2 ftom_G_3 ftom_A_3 ivt_G_1 ivt_A_1

For Method 2:

since only one replication for ftom and ivt, I change the parameter Y to N.

# change Y to N for one replicate
Rscript 3_run_model_ivt.R $ftom_Rep1.ivt.count.table.txt $ftom_Rep1.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1

Rscript 3_run_model_ivt.R $ftom_Rep2.ivt.count.table.txt $ftom_Rep2.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1

Rscript 3_run_model_ivt.R $ftom_Rep3.ivt.count.table.txt $ftom_Rep3.ivt.count.table.out.txt 50000 T 123 N 10 0.05 0.1

shunliubio commented 1 year ago

Hi Zongmin,

I prefer Method 2 as you can check the reproducibility of samples. If it is great, you can use Method 1 to merge samples and get more depth to do the analysis.

It is weird to get empty table using Method 1. Can you check the depth and reads you get from the samples? Maybe you can get some hints from Method 2 as you did pair-wise intersections among the samples.

shunliubio / eTAM-seq_workflow

eTAM workflow details #5