COBS identifiers too long

karel-brinda commented 2 years ago

COBS sample identifiers are very long, due to which all auxiliary files are very long, i.e., see an example from my current experiment:

$ xz -l intermediate/01_match/*.xz | tail
    1       1  3,076.6 KiB  8,880.1 KiB  0.346  CRC64   intermediate/01_match/vibrio_parahaemolyticus__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/vibrio_shilonii__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/vibrio_vulnificus__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/wolbachia_endosymbiont_of_drosophila_melanogaster__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/xanthomonas_oryzae__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/yersinia_enterocolitica__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/yersinia_pestis__01____ERR9030361.xz
    1       1  3,068.7 KiB  8,873.7 KiB  0.346  CRC64   intermediate/01_match/yersinia_pseudotuberculosis__01____ERR9030361.xz
-------------------------------------------------------------------------------
  305     305  2,249.8 MiB     47.0 GiB  0.047  CRC64   305 files

I.e., overall, 47 GB to iterate through.

Individual files look like:

*ERR9030361.6 0e97d92c-08c1-46f1-b5e4-3ec910b3a6c8      438
nrdosklposhzgngrjfjmugywullybscunmvnndqn_SAMEA1119362   521
meaiikupcxcxkarweghrbclaftqgrsroxghfbuvo_SAMEA1119775   516
nrlkfgukkjjraeegxjfuhqchsfldmwthfzvwazed_SAMEA1118631   516
lmqbqabdmzzwgapoixouvhmrokunxxjgdcaylhtm_SAMEA1102577   515
lpuytxenlgiejljhwmtxvrjqdcyjczusifchlpfu_SAMEA1118919   515
nkuhbazbvdagxeyviwnhxqvjouphlqsknyejkzye_SAMEA1101746   515
ntledfmsuqgclmqrektsofzlligfgndvpmlsnmqv_SAMEA1406081   515
lifumdqaapvzscefeokiybgxczkeslpzampzlgnh_SAMEA1117987   514
lifvszdfuaqexdytrwrfcnakbadfrddjzjgbqvgd_SAMEA1117873   514
lnvttadimftxfkwlyrgjtnszlelkrgzxakxowpfx_SAMEA1118416   514
lqpgobhsizcxznjywmfkrubyohokmkixiwmlfxjn_SAMEA1406127   514
lvlzjavfsfsuhuzqegyinpjifkbuakhlywxdrkmx_SAMEA1117944   514
mbiutijytifolmbdioabikxysfocwgaieyarbngp_SAMEA1118251   514
mbovuoywybhblqolqbewdpdsetgpemnykrrjuzgo_SAMEA1119288   514
mbuoggisyqgktytieenoucaylydewhrzqpiayujn_SAMEA1397475   514
mclxwbvoqnjjqyeknavenhreafzusugokarexuzw_SAMEA1118252   514
miirdimvzlaqwtjruxmzbvrmbdkrtpuiozdacdbb_SAMEA1117778   514
mjgpwkpufposaqkqcmkofgfazitaoffahyqakuoa_SAMEA1119914   514
mjrrqdwgwsghjsixhoengpwupdfcmaurwubmfikp_SAMEA1119866   514
mrssljnodortibrmxuussdhdgfeeveewtifqohmu_SAMEA1118384   514
mvmeksixrytmkgkzngnrlkjsgjbwrjpytwhxawmz_SAMEA1397254   514
mwqdqqtphliyduoireqbaeckkauzgissmlqiutfq_SAMEA1117907   514
mxrbsjfpjiabxososfbkmifadyyuyfoijpxvtngx_SAMEA1102217   514
myxdodgfjqhfvapygokmyiqfxblhntglshhabfuf_SAMEA1119513   514

Xz manages to compress the identifiers quite well, but it wastes a lot of resources.

karel-brinda commented 2 years ago

I see two possible solutions:

Putting minimum-size identifiers into the cobs indexes (better in longterm)
Having a post-filtration script on cobs output (a hot-fix, which could be sufficient for now)

karel-brinda commented 2 years ago

This is the consequence:

I.e., xz becomes the bottleneck (since the cobs output is too long), taking 100% of CPU and blocking the rest

karel-brinda commented 2 years ago

And these are the resulting ratios, i.e., the output is really huge

leoisl commented 2 years ago

As we know the batch and read file from the output, we could replace reads/ref by their order number (e.g. read 5 is the 5th read of the fasta/q, ref 2 is the 2nd ref in the batch). Then we would transform this:

*ERR9030361.6 0e97d92c-08c1-46f1-b5e4-3ec910b3a6c8      438
nrdosklposhzgngrjfjmugywullybscunmvnndqn_SAMEA1119362   521
meaiikupcxcxkarweghrbclaftqgrsroxghfbuvo_SAMEA1119775   516
nrlkfgukkjjraeegxjfuhqchsfldmwthfzvwazed_SAMEA1118631   516
lmqbqabdmzzwgapoixouvhmrokunxxjgdcaylhtm_SAMEA1102577   515

into sth like this:

karel-brinda / Phylign

COBS identifiers too long #147