shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

TODO: save the search result into a serializing binary file for fast downstream parsing #40

Open shenwei356 opened 10 months ago

shenwei356 commented 10 months ago

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

  1. kmcp search: a flag -b/--binary-outpu would be added to choose the output format optionally.
  2. A new command kmcp view should be added to convert the binary to plain text format.
  3. kmcp merge needs to be compatible with both plain and binary formats.
  4. kmcp profile needs to be compatible with both plain and binary formats.
#query qLen qKmers FPR hits target chunkIdx chunks tLen kSize mKmers qCov tCov jacc queryIdx
read_1 150 130 7.4626e-15 1 GCF_000007805.1 2 10 6397126 21 130 1.0000 0.0002 0.0002 0
read_2 150 130 7.4626e-15 1 GCF_000007805.1 8 10 6397126 21 130 1.0000 0.0002 0.0002 1
read_3 150 130 7.4626e-15 1 GCF_000003835.1 8 10 12115052 21 130 1.0000 0.0001 0.0001 2
read_4 150 130 7.4626e-15 1 GCF_000003835.1 3 10 12115052 21 130 1.0000 0.0001 0.0001 3
ericvdtoorn commented 10 months ago

The current tab-delimited search result format is redundant and inefficient for parsing in kmcp profile. So we can use a compact binary format to save the temporary result.

  1. kmcp search: a flag -b/--binary-output would be added to choose the output format optionally.

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

ericvdtoorn commented 10 months ago

|#query|qLen|qKmers|FPR |hits|target |chunkIdx|chunks|tLen |kSize|mKmers|qCov |tCov |jacc |queryIdx|

|:-----|:---|:-----|:---------|:---|:--------------|:-------|:-----|:-------|:----|:-----|:-----|:-----|:-----|:-------|

|read_1|150 |130 |7.4626e-15|1 |GCF_000007805.1|2 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|0 |

|read_2|150 |130 |7.4626e-15|1 |GCF_000007805.1|8 |10 |6397126 |21 |130 |1.0000|0.0002|0.0002|1 |

|read_3|150 |130 |7.4626e-15|1 |GCF_000003835.1|8 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|2 |

|read_4|150 |130 |7.4626e-15|1 |GCF_000003835.1|3 |10 |12115052|21 |130 |1.0000|0.0001|0.0001|3 |

Empirically, few of these fields would require an int64 (at least none were close to int32 in a practical file) so that could also be potential space saving

Edit: meant that int32 would probably be enough rather than int64

shenwei356 commented 10 months ago

Would it not be better to infer from the output extension which is usually specified? Make it a .kmcp file or something similar.

Yes, we can make the binary format the default output, and make the plain text format optional.

Empirically, few of these fields would require an int32 (at least none were close in a practical file) so that could also be potential space saving

Right. I'll carefully consider it later. Thank you.