kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

amalgkit integrate is too slow #69

Closed kfuku52 closed 1 year ago

kfuku52 commented 2 years ago

integrate requires several minutes per fq.gz. Perhaps seqkit stat doesn't have to read all fastq reads. The First 100 reads or so should be enough to obtain the read length. Total size is read length x read number. However, getting read number wouldn't be straightforward: zcat C1D_1.fq.gz | grep "^@" | wc -l took a minute or so. An alternative approach would be necessary for substantial speed up.

Hego-CCTB commented 2 years ago

Yeah, seqkit takes exponentially more time the bigger the samples get. It's close to half an hour vs 5 minutes for untrimmed vs trimmed Drosophyllum samples.

I'll see if I can either limit seqkit to the first few sequences.

Hego-CCTB commented 2 years ago

So, this absolutely works and is instant: zcat < DLF1_1.fq.gz | head -n 4000 | seqkit stats

As for read number, we can skip the grep for just wc -l and divide by 4, which shaves off about a minute for this sample (68149702 reads).

56 seconds without grep echo $(zcat < SRR14322310_1.amalgkit.fastq.gz | wc -l)/4|bc

1 minute 53 with grep zcat < SRR14322310_1.amalgkit.fastq.gz | grep "^@"| wc -l

What's very annoying with this approach is, that on mac, zcat has to be called like this:

zcat < SRR14322310_1.amalgkit.fastq.gz

instead of just:

zcat SRR14322310_1.amalgkit.fastq.gz

kfuku52 commented 2 years ago

Hmm... that makes the situation complicated. We shouldn't sacrifice the portability. Maybe we should start tweaking integrate when we find a solution that is universally usable in mac and linux.

Hego-CCTB commented 2 years ago

We can work around this, by checking the OS and adjust the command:


import platform
platform.system()

> 'Darwin'
Hego-CCTB commented 2 years ago

Okay, I implemented the changes to the seqkit input and here is a quick comparison between the old implementation (full seqkit) and the new one (head seqkit). The top 2 entries are the old implementation for single and paired samples and the bottom 2 are the same samples calculated with the new implementation.

scientific_name curate_group run read1_path read2_path is_sampled is_qualified exclusion lib_layout spot_length total_spots total_bases size private_file runtime
full seqkit, single flower trimmed-PAuF /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz no path yes yes no single 148.4 31190694 4627300802 2319701462 yes 46s
full seqkit, paired flower trimmed-PAuF /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_2.fq.gz /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz yes yes no paired 148.3 31190694 9253736528 2445176519 yes 46s
head seqkit, single flower trimmed-PAuF /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz no path yes yes no single 148.4 31190694 4616222712 2319701462 yes 27s
head seqkit, paired flower trimmed-PAuF /Users/s229181/Desktop/projects/sanity_test_wd/trimmed-PAuF_2.fq.gz /Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz yes yes no paired 148.4 31190694 9232445424 2445176519 yes 27s

As expected, the total bases are not completely accurate. For the paired samples, they differ by about 21 million bases. This sounds a lot, but it's just a 0.2% difference. The loss of accuracy may be well worth the gain in speed. This particular sample is already trimmed and fairly small, so the old runs didn't take too long. I'll test this with larger files, where I know that the old version took a lot longer per sample.

Hego-CCTB commented 2 years ago

As a side note, this process could be made even faster by utilizing the gnu parallel for zcat. However, we run into the problem with macOS again, where parallel needs to be installed via brew first.

I will test the potential gain in speed and if it's significant, I'll implement a branch which uses parallel if it's installed and the normal zcat if ont.

UPDATE ON THIS:

looks like .gz decompression can not be parallelized. So zcat is as fast as it gets currently.

Hego-CCTB commented 2 years ago

So, the new implementation processed 21 samples in 1254 seconds (~21 minutes).

Those same 21 samples took 6,637 sec (~ 1 hour and 10 minutes ) in the old version.

Update pushed in 37fc28761c4e1872e131c4fce7b099efb0ca2152

kfuku52 commented 2 years ago

Thank you. It seems that you obsoleted the old implementation, but the change should have been introduced as an option, maybe with the faster mode as default like --accurate_size defaulted to 'no'. Could you restore the original, slow but completely accurate mode as an option?

kfuku52 commented 2 years ago

When we need to introduce a change that is not a pure improvement and to sacrifice some aspects (here, the accuracy), we should be able to choose it upon use.

Hego-CCTB commented 2 years ago

Of course, that makes a lot of sense! On it!

Hego-CCTB commented 2 years ago

Reminder for myself to reintroduce the slow method.

Hego-CCTB commented 1 year ago

--accurate_size yes|no is now a parameter to run the slow, but more accurate seqkit.

https://github.com/kfuku52/amalgkit/commit/acff42d3c652bee8efc95ccaf8993afa303588d9