ncbi / sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Other
42 stars 5 forks source link

Cannot process .gz files #4

Closed kevinlibuit closed 2 years ago

kevinlibuit commented 3 years ago

I assume this is the repo for us.gcr.io/ncbi-research-sra-dataload/test-scrub:latest /opt/scrubber/scripts/scrub.sh?

If so, I'm curious to know if the scrub.sh command has some sort of option for compressed read files (e.g. .fastq.gz). I've been running it as demonstrated in the SPHERES video, but am working with .gz files rather than straight fastq files and am running into this error:

Traceback (most recent call last):
  File "/opt/scrubber/scripts/fastq_to_fasta.py", line 26, in <module>
    main()
  File "/opt/scrubber/scripts/fastq_to_fasta.py", line 19, in main
    raise line
TypeError: exceptions must be old-style classes or derived from BaseException, not str
IsabelFE commented 3 years ago

I am also working with .gz files and I would really appreciate an option for using the scrub.sh tool on compressed files.

multikengineer commented 3 years ago

@kevinlibuit & @IsabelFE Sorry for this tardy reply. There is no current option for reading / writing zipped files, and given the ease of any pipeline to add unzip / zip or any other desired flavor of compression I'd rather not add that. However, I will pass along the request.

rpetit3 commented 2 years ago

Hi @kevinlibuit and @IsabelFE,

I have a fork with this feature, would you be interested in testing it out?

git clone git@github.com:rpetit3/sra-human-scrubber.git
cd sra-human-scrubber
git checkout -b rp3-add-gzip-support
git branch --set-upstream-to=origin/rp3-add-gzip-support rp3-add-gzip-support
git pull

# Get human data via init script
bash ./init_db.sh 

# OR through Docker container (if already available)
docker pull ncbi/sra-human-scrubber:latest
docker run -it -v $PWD:$PWD:rw -w $PWD ncbi/sra-human-scrubber:latest cp /opt/scrubber/data/human_filter.db ./data/

# run test
bash scripts/scrub.sh test
bash scripts/scrub.sh test_gz

# Try with your own data
bash scripts/scrub.sh -n -z YOUR.FASTQ.GZ

Curious to see your results!

Thank you, Robert

multikengineer commented 2 years ago

Robert - The issue was not difficulty in adding compression / decompression calls. Most use this tool in a pipeline / workflow where the user(s) control and set the environment and it is relatively simple to add support to compress / decompress. This repo is used (also) to build a docker image: there is no guarantee that the base image will not change and if a different image was chosen that did not have compression/decompression available, or a different implementation, it would create a difficult situation. In this context, I still feel it wiser to keep scrubber with as few dependencies as possible to avoid problems in the future.

rpetit3 commented 2 years ago

Hi @multikengineer!

I think everyone here would agree, adding gzip support was fairly simple. It seemed appropriate to add this, since gzip is already available in your docker container, and the gzip module is available from the Python standard library. Given you have control of the Dockerfile, I think supporting gzip is pretty straight forward and not difficult at all.

Although having looked at the process for scrubbing reads:

python "$ROOT/scripts/fastq_to_fasta.py" < "$fastq" > "$fastq.fasta"
${ROOT}/bin/aligns_to -db "$ROOT/data/human_filter.db" "$fastq.fasta" | "$ROOT/scripts/cut_spots_fastq.py" "$fastq" ${OPTS} > "$fastq.clean"

I wonder if you could elaborate on the methods. I'm trying to figure out why a tool designed for SRA, doesn't take FASTQ inputs directly, but instead:

  1. converts the FASTQ to FASTA
  2. scrubs human reads from the FASTA
  3. reads both the FASTA and original FASTQ to create the final cleaned FASTQ

What's the rational for these conversions?

Its pretty easy to see aligns_db is very efficient, but that efficiency is lost when accounting for the format conversion overhead. I played around with a large metagenomic sample (~8GB compressed), it took ~30 minutes to complete, but half that time was on the conversions (6 minutes to convert FASTQ to FASTA, 10 minutes to write the cleaned FASTQ). These times do not include the time necessary to uncompress and recompress the FASTQ.

What are your thoughts on aligns_to taking FASTQ data from the STDIN and printing cleaned reads to STDOUT?

zcat my_seqs.fastq.gz | aligns_to -db human_filter.db | gzip -c >  my_seqs.clean.fastq.gz

# OR 

zcat my_super_big_metagenomic.fastq.gz | aligns_to -db human_filter.db | pigz -p 8 -c >  my_super_big_metagenomic.clean.fastq.gz

I'd be happy to give it a go, but it looks like the source code for aligns_db is not available (?). Please let me know if that's not the case.

Either way, I think this would alleviate your need for a gzip (or bzip) dependency, but allow users to read a single file and write a single file .

Looking forward to your replay, and thank you very much for your time! Robert

multikengineer commented 2 years ago

What's the rational for these conversions?

_alignsto was was not designed to read fastq.

source code for aligns_db

https://github.com/ncbi/ngs-tools/tree/tax/tools/tax/src

rpetit3 commented 2 years ago

Ah, I see. That's unfortunate aligns_to doesn't support FASTQs, but thank you for pointing me to the source.

Do you think it would be worthwhile to pursue adding FASTQ support to aligns_to? If you don't think it'll be well received, then I won't pursue it.

multikengineer commented 2 years ago

@rpetit3 Thanks for your understanding. The issue is not really adding FASTQ support. We know from other experience that keeping the needed info in memory for matching the identified spots to remove becomes a significant problem.

multikengineer commented 2 years ago

Closing as explained my reasons - should factors change, it can be re-opened.