Closed kevinlibuit closed 3 years ago
I am also working with .gz files and I would really appreciate an option for using the scrub.sh tool on compressed files.
@kevinlibuit & @IsabelFE Sorry for this tardy reply. There is no current option for reading / writing zipped files, and given the ease of any pipeline to add unzip / zip or any other desired flavor of compression I'd rather not add that. However, I will pass along the request.
Hi @kevinlibuit and @IsabelFE,
I have a fork with this feature, would you be interested in testing it out?
git clone git@github.com:rpetit3/sra-human-scrubber.git
cd sra-human-scrubber
git checkout -b rp3-add-gzip-support
git branch --set-upstream-to=origin/rp3-add-gzip-support rp3-add-gzip-support
git pull
# Get human data via init script
bash ./init_db.sh
# OR through Docker container (if already available)
docker pull ncbi/sra-human-scrubber:latest
docker run -it -v $PWD:$PWD:rw -w $PWD ncbi/sra-human-scrubber:latest cp /opt/scrubber/data/human_filter.db ./data/
# run test
bash scripts/scrub.sh test
bash scripts/scrub.sh test_gz
# Try with your own data
bash scripts/scrub.sh -n -z YOUR.FASTQ.GZ
Curious to see your results!
Thank you, Robert
Robert - The issue was not difficulty in adding compression / decompression calls. Most use this tool in a pipeline / workflow where the user(s) control and set the environment and it is relatively simple to add support to compress / decompress. This repo is used (also) to build a docker image: there is no guarantee that the base image will not change and if a different image was chosen that did not have compression/decompression available, or a different implementation, it would create a difficult situation. In this context, I still feel it wiser to keep scrubber with as few dependencies as possible to avoid problems in the future.
Hi @multikengineer!
I think everyone here would agree, adding gzip
support was fairly simple. It seemed appropriate to add this, since gzip
is already available in your docker container, and the gzip
module is available from the Python standard library. Given you have control of the Dockerfile, I think supporting gzip
is pretty straight forward and not difficult at all.
Although having looked at the process for scrubbing reads:
python "$ROOT/scripts/fastq_to_fasta.py" < "$fastq" > "$fastq.fasta"
${ROOT}/bin/aligns_to -db "$ROOT/data/human_filter.db" "$fastq.fasta" | "$ROOT/scripts/cut_spots_fastq.py" "$fastq" ${OPTS} > "$fastq.clean"
I wonder if you could elaborate on the methods. I'm trying to figure out why a tool designed for SRA, doesn't take FASTQ inputs directly, but instead:
What's the rational for these conversions?
Its pretty easy to see aligns_db
is very efficient, but that efficiency is lost when accounting for the format conversion overhead. I played around with a large metagenomic sample (~8GB compressed), it took ~30 minutes to complete, but half that time was on the conversions (6 minutes to convert FASTQ to FASTA, 10 minutes to write the cleaned FASTQ). These times do not include the time necessary to uncompress and recompress the FASTQ.
What are your thoughts on aligns_to
taking FASTQ data from the STDIN and printing cleaned reads to STDOUT?
zcat my_seqs.fastq.gz | aligns_to -db human_filter.db | gzip -c > my_seqs.clean.fastq.gz
# OR
zcat my_super_big_metagenomic.fastq.gz | aligns_to -db human_filter.db | pigz -p 8 -c > my_super_big_metagenomic.clean.fastq.gz
I'd be happy to give it a go, but it looks like the source code for aligns_db
is not available (?). Please let me know if that's not the case.
Either way, I think this would alleviate your need for a gzip
(or bzip
) dependency, but allow users to read a single file and write a single file .
Looking forward to your replay, and thank you very much for your time! Robert
What's the rational for these conversions?
_alignsto was was not designed to read fastq.
source code for aligns_db
Ah, I see. That's unfortunate aligns_to
doesn't support FASTQs, but thank you for pointing me to the source.
Do you think it would be worthwhile to pursue adding FASTQ support to aligns_to
? If you don't think it'll be well received, then I won't pursue it.
@rpetit3 Thanks for your understanding. The issue is not really adding FASTQ support. We know from other experience that keeping the needed info in memory for matching the identified spots to remove becomes a significant problem.
Closing as explained my reasons - should factors change, it can be re-opened.
I assume this is the repo for us.gcr.io/ncbi-research-sra-dataload/test-scrub:latest /opt/scrubber/scripts/scrub.sh?
If so, I'm curious to know if the scrub.sh command has some sort of option for compressed read files (e.g. .fastq.gz). I've been running it as demonstrated in the SPHERES video, but am working with .gz files rather than straight fastq files and am running into this error: