Open jltsiren opened 2 days ago
We may add an integer field bwt_type
in the header to indicate how multi-string BWT is represented:
For a single-string BWT, we sometimes don't store the sentinel in BWT but instead use one additional integer for its position. Not sure how common this is. If common, we may consider a sentinel_pos
field in the header.
Bidirectional BWT uses two separate BWTs. FMD BWT is DNA-specific and can be validated based on A/C/G/T counts. It may be too DNA focused. One option is to define highly specialized bwt_type
that mandates alphabet mapping, sentinel order etc. For example, we may use type "11" for FMD multi-string BWT in the input order. Downstream tools will know exactly they can expect. We may also define a new subtype
field and/or merge atype
into bwt_type
/subtype
. More thoughts needed...
The number of strings and the total length of the strings may not be easy to obtain upon the creation of a BRE file. For example, grlBWT don't output such information. We cannot convert a grlBWT BWT without reading all data in a second pass. This becomes more complicated when BWT is read from stdin.
FMD BWT can be generalized to any alphabet. It just needs a complement function for the symbols. For example, the GBWT is an FMD BWT, where complementation flips node orientation. And by using the identity function, you get bidirectional BWT as a single BWT.
Thank you Heng for starting this work! I have two suggestions:
The format should make the definition of the BWT it uses explicit, because that is also one fundamental source of incompatibility for BWTs built with different tools. Some tools build a multi-string BWT / BCR BWT for conceptually separate strings, while others concatenate the strings (possibly with special separator symbols distinct from the sentinel) and build a single-string BWT for the concatenation.
I'm personally in favor of having some mandatory descriptive information in the header. Information such as alphabet size, the number of strings, the total length of the strings, and whether this is a unidirectional / bidirectional / FMD BWT. That information is meant less for the software and more for the user. Or for the developer trying to debug an issue the user has. I've seen far too many issues caused by the user having the right data structures for the wrong data. Having some basic information easily accessible can speed up debugging considerably.