Closed kfuku52 closed 2 years ago
Yeah, seqkit takes exponentially more time the bigger the samples get. It's close to half an hour vs 5 minutes for untrimmed vs trimmed Drosophyllum samples.
I'll see if I can either limit seqkit to the first few sequences.
So, this absolutely works and is instant:
zcat < DLF1_1.fq.gz | head -n 4000 | seqkit stats
As for read number, we can skip the grep for just wc -l
and divide by 4, which shaves off about a minute for this sample (68149702 reads).
56 seconds without grep
echo $(zcat < SRR14322310_1.amalgkit.fastq.gz | wc -l)/4|bc
1 minute 53 with grep
zcat < SRR14322310_1.amalgkit.fastq.gz | grep "^@"| wc -l
What's very annoying with this approach is, that on mac, zcat has to be called like this:
zcat < SRR14322310_1.amalgkit.fastq.gz
instead of just:
zcat SRR14322310_1.amalgkit.fastq.gz
Hmm... that makes the situation complicated. We shouldn't sacrifice the portability. Maybe we should start tweaking integrate
when we find a solution that is universally usable in mac and linux.
We can work around this, by checking the OS and adjust the command:
import platform
platform.system()
> 'Darwin'
Okay, I implemented the changes to the seqkit input and here is a quick comparison between the old implementation (full seqkit) and the new one (head seqkit). The top 2 entries are the old implementation for single and paired samples and the bottom 2 are the same samples calculated with the new implementation.
scientific_name | curate_group | run | read1_path | read2_path | is_sampled | is_qualified | exclusion | lib_layout | spot_length | total_spots | total_bases | size | private_file | runtime |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
full seqkit, single | flower | trimmed-PAuF | /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz | no path | yes | yes | no | single | 148.4 | 31190694 | 4627300802 | 2319701462 | yes | 46s |
full seqkit, paired | flower | trimmed-PAuF | /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_2.fq.gz | /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz | yes | yes | no | paired | 148.3 | 31190694 | 9253736528 | 2445176519 | yes | 46s |
head seqkit, single | flower | trimmed-PAuF | /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz | no path | yes | yes | no | single | 148.4 | 31190694 | 4616222712 | 2319701462 | yes | 27s |
head seqkit, paired | flower | trimmed-PAuF | /Users/s229181/Desktop/projects/sanity_test_wd/trimmed-PAuF_2.fq.gz | /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz | yes | yes | no | paired | 148.4 | 31190694 | 9232445424 | 2445176519 | yes | 27s |
As expected, the total bases are not completely accurate. For the paired samples, they differ by about 21 million bases. This sounds a lot, but it's just a 0.2% difference. The loss of accuracy may be well worth the gain in speed. This particular sample is already trimmed and fairly small, so the old runs didn't take too long. I'll test this with larger files, where I know that the old version took a lot longer per sample.
As a side note, this process could be made even faster by utilizing the gnu parallel for zcat. However, we run into the problem with macOS again, where parallel needs to be installed via brew first.
I will test the potential gain in speed and if it's significant, I'll implement a branch which uses parallel
if it's installed and the normal zcat if ont.
UPDATE ON THIS:
looks like .gz decompression can not be parallelized. So zcat is as fast as it gets currently.
So, the new implementation processed 21 samples in 1254 seconds (~21 minutes).
Those same 21 samples took 6,637 sec (~ 1 hour and 10 minutes ) in the old version.
Update pushed in 37fc28761c4e1872e131c4fce7b099efb0ca2152
Thank you. It seems that you obsoleted the old implementation, but the change should have been introduced as an option, maybe with the faster mode as default like --accurate_size defaulted to 'no'. Could you restore the original, slow but completely accurate mode as an option?
When we need to introduce a change that is not a pure improvement and to sacrifice some aspects (here, the accuracy), we should be able to choose it upon use.
Of course, that makes a lot of sense! On it!
Reminder for myself to reintroduce the slow method.
--accurate_size yes|no
is now a parameter to run the slow, but more accurate seqkit.
https://github.com/kfuku52/amalgkit/commit/acff42d3c652bee8efc95ccaf8993afa303588d9
integrate
requires several minutes per fq.gz. Perhapsseqkit stat
doesn't have to read all fastq reads. The First 100 reads or so should be enough to obtain the read length. Total size is read length x read number. However, getting read number wouldn't be straightforward:zcat C1D_1.fq.gz | grep "^@" | wc -l
took a minute or so. An alternative approach would be necessary for substantial speed up.