ncbi / sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Other
42 stars 5 forks source link

fixed path #15

Closed k-florek closed 2 years ago

k-florek commented 2 years ago

I noticed the PATH had a ; in it preventing the user from executing the scrub.sh script without specifying the path. I fixed this here.

k-florek commented 2 years ago

@multikengineer I just pushed a bunch of additional changes to the scrub.sh script. These changes could allow the tool to be applied to a greater number of use cases.

Specifically related to Issue #8 I included the ability to specify an input and output file using -i and -o argument flags but also added the ability to take fastq files from stdin and output to stdout. This allows users to wrap the tool with any compression tool they see necessary. i.e. gunzip sample.fastq.gz | scrub.sh | gzip > sample.clean.fastq.gz without adding dependencies to the tool and docker image. This also works when using docker gunzip sample.fastq.gz | docker run ncbi/sra-human-scrubber bash -c "scrub.sh" | gzip > sample.clean.fastq.gz.

With the additional changes to the argument flags I also added the option to specify a path to the database using -d flag to address issue #10.

I also updated the testing function to use the -t flag for testing the script.

These changes drastically alter the usage of the scrub.sh but I personally feel these changes would be helpful for the community of users. However, I am not aware of all the use cases and design decisions so feel free to take them or leave them!

kevinlibuit commented 2 years ago

Really helpful PR! Just want to second @k-florek's sentiment that these changes would broaden the applications of an already extremely useful resource.

rpetit3 commented 2 years ago

Hi all, I played around with @k-florek PR a bit over the weekend. I ran a few ~15 FASTQs through the current scrubber amd the changes in this PR.

I'm happy to report there were do differences in the outputs between the two. There was a slight speed up in this PR when dealing with larger compressed FASTQs. This was on the scale of minutes, whish can really add up when processing 1000s of samples.

Test results

Times are reported in seconds

fastq   input_size  current_runtime gunzip_runtime  scrub_runtime   gzip_runtime    pr_runtime  time_difference current_md5 pr_md5  md5s_equal
fastqs/nextseq-21060550_S8_L001_R1_001.fastq.gz 1196060293  2825    62  2441    321 1897    928 2cd124c3ca83afc9b525ab98c22b74f1    2cd124c3ca83afc9b525ab98c22b74f1    TRUE
fastqs/nextseq-21060550_S8_L001_R2_001.fastq.gz 1244972907  1149    35  857 256 1103    46  0a3855c7ad1e7b81c572c5e59a350fd9    0a3855c7ad1e7b81c572c5e59a350fd9    TRUE
fastqs/nextseq-21061395_S9_L001_R2_001.fastq.gz 1150902242  1124    31  851 242 932 192 fa800bcad6e58c4180903d3d9f40ccfe    fa800bcad6e58c4180903d3d9f40ccfe    TRUE
fastqs/nextseq-21067837_S45_L001_R1_001.fastq.gz    1374874988  1175    38  814 322 1019    156 d4b9786ac49673144b0cf66e978f8d71    d4b9786ac49673144b0cf66e978f8d71    TRUE
fastqs/nextseq-21067903_S47_L001_R2_001.fastq.gz    1288848103  926 35  601 290 860 66  4f28a878f9854a7f6cec48ecbb9c3e12    4f28a878f9854a7f6cec48ecbb9c3e12    TRUE
fastqs/nextseq-21069148_S96_L001_R2_001.fastq.gz    1158793737  1051    32  793 225 871 180 528656393d0febc8a8c4c5feb2301bc9    528656393d0febc8a8c4c5feb2301bc9    TRUE
fastqs/nextseq-21070200_S17_L001_R1_001.fastq.gz    1354315407  1591    37  1231    323 1494    97  b46183d1ed89b694acd6de616af32917    b46183d1ed89b694acd6de616af32917    TRUE
fastqs/nextseq-21071460_S11_L001_R2_001.fastq.gz    1190840921  1368    31  1044    293 1221    147 276b2ee12d5b8a84434b7f4441106a74    276b2ee12d5b8a84434b7f4441106a74    TRUE
fastqs/nextseq-21071905_S15_L001_R1_001.fastq.gz    1100936394  1039    30  760 249 985 54  4ec324bdb862514bce48ae15d8e806a1    4ec324bdb862514bce48ae15d8e806a1    TRUE
fastqs/nextseq-21072369_S4_L001_R1_001.fastq.gz 1377452873  1565    38  1198    328 1334    231 30c58af27c0d60dae6913b6ddcf73e14    30c58af27c0d60dae6913b6ddcf73e14    TRUE
fastqs/ont-D01_20120905.fastq.gz    101906832   45  2   33  10  45  0   6a6b5eb9ecd1dd0a223519f5c0764051    6a6b5eb9ecd1dd0a223519f5c0764051    TRUE
fastqs/ont-D03_20150057.fastq.gz    184218996   85  4   63  18  84  1   5753e0411e442f07bd014e8ecb75ba31    5753e0411e442f07bd014e8ecb75ba31    TRUE
fastqs/ont-E01_21052359.fastq.gz    177265365   85  4   63  18  84  1   5b43ffa15dbce4cda2233816039e2b82    5b43ffa15dbce4cda2233816039e2b82    TRUE
fastqs/ont-E03_21049208.fastq.gz    123230954   56  3   41  12  55  1   fab8a46e99d6667ee22fbf67250e1bdd    fab8a46e99d6667ee22fbf67250e1bdd    TRUE
fastqs/ont-H02_21048589.fastq.gz    194287953   98  4   73  21  97  1   a9773b5d710ea66c446dc42b7928172d    a9773b5d710ea66c446dc42b7928172d    TRUE

Script used for testing

heres the script I used for testing, let me know if there is any glaring issues

#! /bin/bash

SCRUBBER_DB=${SCRUBBER_SHARE}/data/human_filter.db

# Current scrubber
echo "Run test on current Scrubber"
which scrub.sh
scrub.sh test

echo "Run test on PR Scrubber"
sra-human-scrubber/scripts/scrub.sh -t -d ${SCRUBBER_SHARE}/data/human_filter.db

echo "Compare scrubbers"
printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "fastq" "input_size" "current_runtime" "gunzip_runtime" "scrub_runtime" "gzip_runtime" "pr_runtime" "current_md5" "pr_md5" "md5s_equal"
rm -rf current_scrubber pr_scrubber
mkdir current_scrubber pr_scrubber
for f in fastqs/*.fastq.gz; do 
    file_size=$(du ${f} -b | cut -f 1)
    file_name="$(basename ${f%%.*})"
    # Test urrent scrubber
    start=`date +%s`
    start_gunzip=`date +%s`
    gunzip -c ${f} > current_scrubber/${file_name}.fastq
    end_gunzip=`date +%s`
    start_scrub=`date +%s`
    scrub.sh current_scrubber/${file_name}.fastq
    end_scrub=`date +%s`
    start_gzip=`date +%s`
    gzip current_scrubber/${file_name}.fastq.clean
    end_gzip=`date +%s`
    rm current_scrubber/${file_name}.fastq
    end=`date +%s`
    current_runtime=$((end-start))
    current_gunzip=$((end_gunzip-start_gunzip))
    current_scrub=$((end_scrub-start_scrub))
    current_gzip=$((end_gzip-start_gzip))
    current_fastq="current_scrubber/${file_name}.fastq.clean.gz"

    # test PR scrubber
    pr_fastq="pr_scrubber/${file_name}.clean.fastq.gz"
    start=`date +%s`
    zcat ${f} | sra-human-scrubber/scripts/scrub.sh -d ${SCRUBBER_SHARE}/data/human_filter.db | gzip > ${pr_fastq}
    end=`date +%s`
    pr_runtime=$((end-start))

    # test md5s of fastq contents
    current_md5=$(zcat ${current_fastq} | md5sum | cut -d " " -f 1)
    pr_md5=$(zcat ${pr_fastq} | md5sum | cut -d " " -f 1)
    md5s_equal="false"
    if [ "${current_md5}" == "${pr_md5}" ]; then
        md5s_equal="true"
    fi

    printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" "${f}" "${file_size}" "${current_runtime}" "${current_gunzip}" "${current_scrub}" "${current_gzip}" "${pr_runtime}" "${current_md5}" "${pr_md5}" "${md5s_equal}"
done
multikengineer commented 2 years ago

@kevinlibuit @k-florek @rpetit3 Please accept my apologies: I have not been ignoring your comments, changes, and tests - I appreciate it all. Other priorities had overtaken me, but I am hopeful that this week I can focus on these changes and act accordingly. Thank you for your patience.

multikengineer commented 2 years ago

@kevinlibuit @k-florek @rpetit3 I must once more ask your forbearance: late Sunday night I got reviews back for a submitted manuscript and the revisions are due Oct 14. Apologies to our good colleagues.

rpetit3 commented 2 years ago

No worries! Haha hope the reviewers didn't ask for too much

multikengineer commented 2 years ago

@k-florek @kevinlibuit @rpetit3 Again, please accept my apologies for how long this took (I finally got a chance last night). I am going to close this PR because I have created another PR that includes all of Kelsey's changes (thank you @k-florek ), with a tweak here and there mostly for maintaining backwards script compatibility which I think is need before the next set of changes. In other words, everything should be there. Please check the PR and the branch (SRAB-44-k-florek-changes) and let me know.

multikengineer commented 2 years ago

Closing in lieu of PR #16