lh3 / bre

4 stars 0 forks source link

BWT definition + descriptive header information #1

Open jltsiren opened 2 days ago

jltsiren commented 2 days ago

Thank you Heng for starting this work! I have two suggestions:

The format should make the definition of the BWT it uses explicit, because that is also one fundamental source of incompatibility for BWTs built with different tools. Some tools build a multi-string BWT / BCR BWT for conceptually separate strings, while others concatenate the strings (possibly with special separator symbols distinct from the sentinel) and build a single-string BWT for the concatenation.

I'm personally in favor of having some mandatory descriptive information in the header. Information such as alphabet size, the number of strings, the total length of the strings, and whether this is a unidirectional / bidirectional / FMD BWT. That information is meant less for the software and more for the user. Or for the developer trying to debug an issue the user has. I've seen far too many issues caused by the user having the right data structures for the wrong data. Having some basic information easily accessible can speed up debugging considerably.

lh3 commented 2 days ago

We may add an integer field bwt_type in the header to indicate how multi-string BWT is represented:

  1. Undefined
  2. Single string. One sentinel stored. This case includes multi-string concatenation without sentinels, or concatenation with a special symbol.
  3. Multi-string BWT, simple concatenation (sentinels are not distinguished from each other)
  4. Multi-string BWT, input order (sentinels ordered by input). Input order is special in that we can retrieve the $i$-th input sequence.
  5. Multi-string BWT, other order. We do not need to enumerate other orders because downstream tools can't easily use the information anyway. See also Cenzato and Lipták (2024).
  6. eBWT by Mantaci et al (2007). I am not sure if eBWT requires additional information.

For a single-string BWT, we sometimes don't store the sentinel in BWT but instead use one additional integer for its position. Not sure how common this is. If common, we may consider a sentinel_pos field in the header.

Bidirectional BWT uses two separate BWTs. FMD BWT is DNA-specific and can be validated based on A/C/G/T counts. It may be too DNA focused. One option is to define highly specialized bwt_type that mandates alphabet mapping, sentinel order etc. For example, we may use type "11" for FMD multi-string BWT in the input order. Downstream tools will know exactly they can expect. We may also define a new subtype field and/or merge atype into bwt_type/subtype. More thoughts needed...

The number of strings and the total length of the strings may not be easy to obtain upon the creation of a BRE file. For example, grlBWT don't output such information. We cannot convert a grlBWT BWT without reading all data in a second pass. This becomes more complicated when BWT is read from stdin.

jltsiren commented 2 days ago

FMD BWT can be generalized to any alphabet. It just needs a complement function for the symbols. For example, the GBWT is an FMD BWT, where complementation flips node orientation. And by using the identity function, you get bidirectional BWT as a single BWT.