Closed k-florek closed 2 years ago
@multikengineer I just pushed a bunch of additional changes to the scrub.sh
script. These changes could allow the tool to be applied to a greater number of use cases.
Specifically related to Issue #8 I included the ability to specify an input and output file using -i
and -o
argument flags but also added the ability to take fastq files from stdin and output to stdout. This allows users to wrap the tool with any compression tool they see necessary. i.e. gunzip sample.fastq.gz | scrub.sh | gzip > sample.clean.fastq.gz
without adding dependencies to the tool and docker image. This also works when using docker gunzip sample.fastq.gz | docker run ncbi/sra-human-scrubber bash -c "scrub.sh" | gzip > sample.clean.fastq.gz
.
With the additional changes to the argument flags I also added the option to specify a path to the database using -d
flag to address issue #10.
I also updated the testing function to use the -t
flag for testing the script.
These changes drastically alter the usage of the scrub.sh but I personally feel these changes would be helpful for the community of users. However, I am not aware of all the use cases and design decisions so feel free to take them or leave them!
Really helpful PR! Just want to second @k-florek's sentiment that these changes would broaden the applications of an already extremely useful resource.
Hi all, I played around with @k-florek PR a bit over the weekend. I ran a few ~15 FASTQs through the current scrubber amd the changes in this PR.
I'm happy to report there were do differences in the outputs between the two. There was a slight speed up in this PR when dealing with larger compressed FASTQs. This was on the scale of minutes, whish can really add up when processing 1000s of samples.
Times are reported in seconds
fastq input_size current_runtime gunzip_runtime scrub_runtime gzip_runtime pr_runtime time_difference current_md5 pr_md5 md5s_equal
fastqs/nextseq-21060550_S8_L001_R1_001.fastq.gz 1196060293 2825 62 2441 321 1897 928 2cd124c3ca83afc9b525ab98c22b74f1 2cd124c3ca83afc9b525ab98c22b74f1 TRUE
fastqs/nextseq-21060550_S8_L001_R2_001.fastq.gz 1244972907 1149 35 857 256 1103 46 0a3855c7ad1e7b81c572c5e59a350fd9 0a3855c7ad1e7b81c572c5e59a350fd9 TRUE
fastqs/nextseq-21061395_S9_L001_R2_001.fastq.gz 1150902242 1124 31 851 242 932 192 fa800bcad6e58c4180903d3d9f40ccfe fa800bcad6e58c4180903d3d9f40ccfe TRUE
fastqs/nextseq-21067837_S45_L001_R1_001.fastq.gz 1374874988 1175 38 814 322 1019 156 d4b9786ac49673144b0cf66e978f8d71 d4b9786ac49673144b0cf66e978f8d71 TRUE
fastqs/nextseq-21067903_S47_L001_R2_001.fastq.gz 1288848103 926 35 601 290 860 66 4f28a878f9854a7f6cec48ecbb9c3e12 4f28a878f9854a7f6cec48ecbb9c3e12 TRUE
fastqs/nextseq-21069148_S96_L001_R2_001.fastq.gz 1158793737 1051 32 793 225 871 180 528656393d0febc8a8c4c5feb2301bc9 528656393d0febc8a8c4c5feb2301bc9 TRUE
fastqs/nextseq-21070200_S17_L001_R1_001.fastq.gz 1354315407 1591 37 1231 323 1494 97 b46183d1ed89b694acd6de616af32917 b46183d1ed89b694acd6de616af32917 TRUE
fastqs/nextseq-21071460_S11_L001_R2_001.fastq.gz 1190840921 1368 31 1044 293 1221 147 276b2ee12d5b8a84434b7f4441106a74 276b2ee12d5b8a84434b7f4441106a74 TRUE
fastqs/nextseq-21071905_S15_L001_R1_001.fastq.gz 1100936394 1039 30 760 249 985 54 4ec324bdb862514bce48ae15d8e806a1 4ec324bdb862514bce48ae15d8e806a1 TRUE
fastqs/nextseq-21072369_S4_L001_R1_001.fastq.gz 1377452873 1565 38 1198 328 1334 231 30c58af27c0d60dae6913b6ddcf73e14 30c58af27c0d60dae6913b6ddcf73e14 TRUE
fastqs/ont-D01_20120905.fastq.gz 101906832 45 2 33 10 45 0 6a6b5eb9ecd1dd0a223519f5c0764051 6a6b5eb9ecd1dd0a223519f5c0764051 TRUE
fastqs/ont-D03_20150057.fastq.gz 184218996 85 4 63 18 84 1 5753e0411e442f07bd014e8ecb75ba31 5753e0411e442f07bd014e8ecb75ba31 TRUE
fastqs/ont-E01_21052359.fastq.gz 177265365 85 4 63 18 84 1 5b43ffa15dbce4cda2233816039e2b82 5b43ffa15dbce4cda2233816039e2b82 TRUE
fastqs/ont-E03_21049208.fastq.gz 123230954 56 3 41 12 55 1 fab8a46e99d6667ee22fbf67250e1bdd fab8a46e99d6667ee22fbf67250e1bdd TRUE
fastqs/ont-H02_21048589.fastq.gz 194287953 98 4 73 21 97 1 a9773b5d710ea66c446dc42b7928172d a9773b5d710ea66c446dc42b7928172d TRUE
heres the script I used for testing, let me know if there is any glaring issues
#! /bin/bash
SCRUBBER_DB=${SCRUBBER_SHARE}/data/human_filter.db
# Current scrubber
echo "Run test on current Scrubber"
which scrub.sh
scrub.sh test
echo "Run test on PR Scrubber"
sra-human-scrubber/scripts/scrub.sh -t -d ${SCRUBBER_SHARE}/data/human_filter.db
echo "Compare scrubbers"
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "fastq" "input_size" "current_runtime" "gunzip_runtime" "scrub_runtime" "gzip_runtime" "pr_runtime" "current_md5" "pr_md5" "md5s_equal"
rm -rf current_scrubber pr_scrubber
mkdir current_scrubber pr_scrubber
for f in fastqs/*.fastq.gz; do
file_size=$(du ${f} -b | cut -f 1)
file_name="$(basename ${f%%.*})"
# Test urrent scrubber
start=`date +%s`
start_gunzip=`date +%s`
gunzip -c ${f} > current_scrubber/${file_name}.fastq
end_gunzip=`date +%s`
start_scrub=`date +%s`
scrub.sh current_scrubber/${file_name}.fastq
end_scrub=`date +%s`
start_gzip=`date +%s`
gzip current_scrubber/${file_name}.fastq.clean
end_gzip=`date +%s`
rm current_scrubber/${file_name}.fastq
end=`date +%s`
current_runtime=$((end-start))
current_gunzip=$((end_gunzip-start_gunzip))
current_scrub=$((end_scrub-start_scrub))
current_gzip=$((end_gzip-start_gzip))
current_fastq="current_scrubber/${file_name}.fastq.clean.gz"
# test PR scrubber
pr_fastq="pr_scrubber/${file_name}.clean.fastq.gz"
start=`date +%s`
zcat ${f} | sra-human-scrubber/scripts/scrub.sh -d ${SCRUBBER_SHARE}/data/human_filter.db | gzip > ${pr_fastq}
end=`date +%s`
pr_runtime=$((end-start))
# test md5s of fastq contents
current_md5=$(zcat ${current_fastq} | md5sum | cut -d " " -f 1)
pr_md5=$(zcat ${pr_fastq} | md5sum | cut -d " " -f 1)
md5s_equal="false"
if [ "${current_md5}" == "${pr_md5}" ]; then
md5s_equal="true"
fi
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${f}" "${file_size}" "${current_runtime}" "${current_gunzip}" "${current_scrub}" "${current_gzip}" "${pr_runtime}" "${current_md5}" "${pr_md5}" "${md5s_equal}"
done
@kevinlibuit @k-florek @rpetit3 Please accept my apologies: I have not been ignoring your comments, changes, and tests - I appreciate it all. Other priorities had overtaken me, but I am hopeful that this week I can focus on these changes and act accordingly. Thank you for your patience.
@kevinlibuit @k-florek @rpetit3 I must once more ask your forbearance: late Sunday night I got reviews back for a submitted manuscript and the revisions are due Oct 14. Apologies to our good colleagues.
No worries! Haha hope the reviewers didn't ask for too much
@k-florek @kevinlibuit @rpetit3 Again, please accept my apologies for how long this took (I finally got a chance last night). I am going to close this PR because I have created another PR that includes all of Kelsey's changes (thank you @k-florek ), with a tweak here and there mostly for maintaining backwards script compatibility
which I think is need before the next set of changes. In other words, everything should be there. Please check the PR and the branch (SRAB-44-k-florek-changes) and let me know.
Closing in lieu of PR #16
I noticed the PATH had a
;
in it preventing the user from executing thescrub.sh
script without specifying the path. I fixed this here.