Closed karel-brinda closed 2 years ago
I see two possible solutions:
This is the consequence:
I.e., xz becomes the bottleneck (since the cobs output is too long), taking 100% of CPU and blocking the rest
And these are the resulting ratios, i.e., the output is really huge
As we know the batch and read file from the output, we could replace reads/ref by their order number (e.g. read 5 is the 5th read of the fasta/q, ref 2 is the 2nd ref in the batch). Then we would transform this:
*ERR9030361.6 0e97d92c-08c1-46f1-b5e4-3ec910b3a6c8 438
nrdosklposhzgngrjfjmugywullybscunmvnndqn_SAMEA1119362 521
meaiikupcxcxkarweghrbclaftqgrsroxghfbuvo_SAMEA1119775 516
nrlkfgukkjjraeegxjfuhqchsfldmwthfzvwazed_SAMEA1118631 516
lmqbqabdmzzwgapoixouvhmrokunxxjgdcaylhtm_SAMEA1102577 515
into sth like this:
*5 438
3 521
20 516
82 516
100 515
COBS sample identifiers are very long, due to which all auxiliary files are very long, i.e., see an example from my current experiment:
I.e., overall, 47 GB to iterate through.
Individual files look like:
Xz manages to compress the identifiers quite well, but it wastes a lot of resources.