Open ekg opened 7 years ago
BWT-merge assumes alphabet size 6. Byte alphabet would require changes to the BWT encoding, the rank structure, and the optimizations in trie traversal.
Building a BWT for 100 terabytes of data would probably take around 10 CPU years. If you want a single FM-index, you also need enough memory for it on a single system, and the construction will take months. If you can live with p indexes, you can distribute the construction to p systems.
Does bwt-merge work with BWTs from non-DNA alphabets?
Given that it takes SGA's BWT format as input this would presumably require SGA to support non-DNA as well or another BWT format?
I'd like to build a full-text index of all wikipedia revisions. This is something like 100TB uncompressed and it's not clear to me if there is a better way to do this than using an approach based on BWT merging. But if there is I'd be curious to know.