grendon commented 6 years ago

Hi, First of all, I could not get the pipeline to run "as is". So, I run each step manually. See scriptlog at the end of this post.

My plots look similar to those of the tutorial. However, when I get to the final part where the filtering analysis is demonstrated I noticed that the metrics of my variants do not match those of the tutorial. In particular, my p.val are all NAs for variants with Read_pos between 50 and 200.

I am puzzled by these discrepancies. Perhaps you @jtmccr1 can help me here. Your input would be greatly appreciated. Thanks

===============================================================

tool provenance

module load MUSCLE/3.8.31-IGB-gcc-4.9.4 module load Bowtie2/2.3.2-IGB-gcc-4.9.4 module load SAMtools/1.7-IGB-gcc-4.9.4 module load Python/3.6.1-IGB-gcc-4.9.4 module load FastQC/0.11.5-IGB-gcc-4.9.4-Java-1.8.0_152 module load picard/2.10.1-Java-1.8.0_152 module load R/3.5.0-IGB-gcc-4.9.4

alignment

bowtie2 --seed 42 --sensitive \ -x ../../data/reference/wsn33_wt_plasmid \ -1 ../../data/fastq/HA-22_S7_L001.1.1.fastq \ -2 ../../data/fastq/HA-22_S7_L001.2.1.fastq \ -S HA-22_S7_L001.sam 2> HA-22_S7_L001.align.log

bowtie2 --seed 42 --sensitive \ -x ../../data/reference/wsn33_wt_plasmid \ -1 ../../data/fastq/Plasmid-control_S49_L001.1.1.fastq \ -2 ../../data/fastq/Plasmid-control_S49_L001.2.1.fastq \ -S Plasmid-control_S49_L001.sam 2> Plasmid-control_S49_L001.align.log

filtering

samtools view -Shq 30 HA-22_S7_L001.sam > HA-22_S7_L001_filtered.sam samtools view -Shq 30 Plasmid-control_S49_L001.sam > Plasmid-control_S49_L001_filtered.sam

mark and remove PCR duplicates

java -Xmx4g -Djava.io.tmpdir=./ -jar ${EBROOTPICARD}/picard.jar MarkDuplicates \ INPUT=HA-22_S7_L001_filtered_sorted.bam \ OUTPUT=HA-22_S7_L001_filtered_sorted_nodup.bam \ REMOVE_DUPLICATES=true \ CREATE_INDEX=true \ METRICS_FILE=HA-22_S7_L001-picard.out.metrics \ VALIDATION_STRINGENCY=LENIENT

java -Xmx4g -Djava.io.tmpdir=./ -jar ${EBROOTPICARD}/picard.jar MarkDuplicates \ INPUT=Plasmid-control_S49_L001_filtered_sorted.bam \ OUTPUT=Plasmid-control_S49_L001_filtered_sorted_nodup.bam \ REMOVE_DUPLICATES=true \ CREATE_INDEX=true \ METRICS_FILE=Plasmid-control_S49_L001-picard.out.metrics \ VALIDATION_STRINGENCY=LENIENT

call variants with deepSNV.R

This R script was not able to convert reference.fasta to regions.bed

#

error message:

could not find function "fasta.info"

#

So, I performed this step offline and then tweaked the script to read this input file instead

reference.bed <- "reference.bed"

test.bam <- "HA-22_S7.bam"

control.bam <- "Plasmid-control_S49.bam"

method <- "bonferroni"

p.cut <- "0.1"

p.com.meth <- "fisher"

disp <- "two.sided"

stringent_freq <- "0.5"

csv <- "results.csv"

fa <- "results.fa"

add metrics with mapq.py

This python script was not parsing the file correctly

#

Error message:

Traceback (most recent call last):

File "../../scripts/mapq.py", line 52, in

pos=int(line[1]) #The position relative to the chromosome

ValueError: invalid literal for int() with base 10: 'HA'

I rerun the deepSNV.R step and noticed that when I SKIPPED the step that adjusts the model

The output file DOES have the format that mapq.py expects.

So, I skipped it

However, the results of my final output file do NOT match those of the tutorial.

I see TOO many variants with p.val=NA

jtmccr1 commented 6 years ago

Thanks for reaching out, and I'm sorry about the confusion.

I am going to merge this issue with issue #8 which also notes issues in the tutorial.

Overtime we have had to update the pipeline to better handle edge cases where the plasmid control and sample differed at the consensus level. I did not keep the tutorial up to date with these changes.

I have been working to make the pipeline more flexible and robust. These changes are on my own personal fork of the project on the robust_stages branch. I plan on merging this recent work back into the main repository soon.

The issue with fasta.info has to do with an update in biostrings. I will be more explicit regarding dependencies in the updated tutorial.

grendon commented 6 years ago

Thanks for the prompt reply. Look forward to the merged version. Do you have a timeline in mind?

jtmccr1 commented 6 years ago

I'm juggling a few projects at the moment, but I plan on having it finalized in 2-3 weeks.

jtmccr1 commented 6 years ago

I merged a pretty lengthly overhaul of the pipeline this morning. There is still some work to do to make it more versatile and user friendly, but I think the issues with the dependencies and the tutorial are fixed. I've also updated the default parameters in the tutorial to only call variants using deepSNV.

I'm going to close the issue but you have any problems or questions, please don't hesitate to comment here or reopen the issue.

lauringlab / variant_pipeline

Can't reproduce results of tutorial, too many variants with p.val=NA #11

tool provenance

alignment

filtering

mark and remove PCR duplicates

call variants with deepSNV.R

This R script was not able to convert reference.fasta to regions.bed

error message:

could not find function "fasta.info"

So, I performed this step offline and then tweaked the script to read this input file instead

reference.bed <- "reference.bed"

test.bam <- "HA-22_S7.bam"

control.bam <- "Plasmid-control_S49.bam"

method <- "bonferroni"

p.cut <- "0.1"

p.com.meth <- "fisher"

disp <- "two.sided"

stringent_freq <- "0.5"

csv <- "results.csv"

fa <- "results.fa"

add metrics with mapq.py

This python script was not parsing the file correctly

Error message:

Traceback (most recent call last):

File "../../scripts/mapq.py", line 52, in

pos=int(line[1]) #The position relative to the chromosome

ValueError: invalid literal for int() with base 10: 'HA'

I rerun the deepSNV.R step and noticed that when I SKIPPED the step that adjusts the model

The output file DOES have the format that mapq.py expects.

So, I skipped it

However, the results of my final output file do NOT match those of the tutorial.

I see TOO many variants with p.val=NA