amalgkit integrate is too slow

kfuku52 commented 3 years ago

integrate requires several minutes per fq.gz. Perhaps seqkit stat doesn't have to read all fastq reads. The First 100 reads or so should be enough to obtain the read length. Total size is read length x read number. However, getting read number wouldn't be straightforward: zcat C1D_1.fq.gz | grep "^@" | wc -l took a minute or so. An alternative approach would be necessary for substantial speed up.

Hego-CCTB commented 3 years ago

Yeah, seqkit takes exponentially more time the bigger the samples get. It's close to half an hour vs 5 minutes for untrimmed vs trimmed Drosophyllum samples.

I'll see if I can either limit seqkit to the first few sequences.

Hego-CCTB commented 3 years ago

So, this absolutely works and is instant: zcat < DLF1_1.fq.gz | head -n 4000 | seqkit stats

As for read number, we can skip the grep for just wc -l and divide by 4, which shaves off about a minute for this sample (68149702 reads).

56 seconds without grep echo $(zcat < SRR14322310_1.amalgkit.fastq.gz | wc -l)/4|bc

1 minute 53 with grep zcat < SRR14322310_1.amalgkit.fastq.gz | grep "^@"| wc -l

What's very annoying with this approach is, that on mac, zcat has to be called like this:

zcat < SRR14322310_1.amalgkit.fastq.gz

instead of just:

zcat SRR14322310_1.amalgkit.fastq.gz

kfuku52 commented 3 years ago

Hmm... that makes the situation complicated. We shouldn't sacrifice the portability. Maybe we should start tweaking integrate when we find a solution that is universally usable in mac and linux.

Hego-CCTB commented 3 years ago

We can work around this, by checking the OS and adjust the command:


import platform
platform.system()

> 'Darwin'

Hego-CCTB commented 3 years ago

Okay, I implemented the changes to the seqkit input and here is a quick comparison between the old implementation (full seqkit) and the new one (head seqkit). The top 2 entries are the old implementation for single and paired samples and the bottom 2 are the same samples calculated with the new implementation.

scientific_name	curate_group	run	read1_path	read2_path	is_sampled	is_qualified	exclusion	lib_layout	spot_length	total_spots	total_bases	size	private_file	runtime
full seqkit, single	flower	trimmed-PAuF	/Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz	no path	yes	yes	no	single	148.4	31190694	4627300802	2319701462	yes	46s
full seqkit, paired	flower	trimmed-PAuF	/Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_2.fq.gz	/Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz	yes	yes	no	paired	148.3	31190694	9253736528	2445176519	yes	46s
head seqkit, single	flower	trimmed-PAuF	/Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz	no path	yes	yes	no	single	148.4	31190694	4616222712	2319701462	yes	27s
head seqkit, paired	flower	trimmed-PAuF	/Users/s229181/Desktop/projects/sanity_test_wd/trimmed-PAuF_2.fq.gz	/Users/s229181/Desktop/seq_data/Data/SD/getfastq/trimmed-PAuF_1.fq.gz	yes	yes	no	paired	148.4	31190694	9232445424	2445176519	yes	27s

As expected, the total bases are not completely accurate. For the paired samples, they differ by about 21 million bases. This sounds a lot, but it's just a 0.2% difference. The loss of accuracy may be well worth the gain in speed. This particular sample is already trimmed and fairly small, so the old runs didn't take too long. I'll test this with larger files, where I know that the old version took a lot longer per sample.

Hego-CCTB commented 3 years ago

As a side note, this process could be made even faster by utilizing the gnu parallel for zcat. However, we run into the problem with macOS again, where parallel needs to be installed via brew first.

I will test the potential gain in speed and if it's significant, I'll implement a branch which uses parallel if it's installed and the normal zcat if ont.

UPDATE ON THIS:

looks like .gz decompression can not be parallelized. So zcat is as fast as it gets currently.

Hego-CCTB commented 3 years ago

So, the new implementation processed 21 samples in 1254 seconds (~21 minutes).

Those same 21 samples took 6,637 sec (~ 1 hour and 10 minutes ) in the old version.

Update pushed in 37fc28761c4e1872e131c4fce7b099efb0ca2152

kfuku52 commented 3 years ago

Thank you. It seems that you obsoleted the old implementation, but the change should have been introduced as an option, maybe with the faster mode as default like --accurate_size defaulted to 'no'. Could you restore the original, slow but completely accurate mode as an option?

kfuku52 commented 3 years ago

When we need to introduce a change that is not a pure improvement and to sacrifice some aspects (here, the accuracy), we should be able to choose it upon use.

Hego-CCTB commented 3 years ago

Of course, that makes a lot of sense! On it!

Hego-CCTB commented 2 years ago

Reminder for myself to reintroduce the slow method.

Hego-CCTB commented 2 years ago

--accurate_size yes|no is now a parameter to run the slow, but more accurate seqkit.

https://github.com/kfuku52/amalgkit/commit/acff42d3c652bee8efc95ccaf8993afa303588d9

kfuku52 / amalgkit

amalgkit integrate is too slow #69