non-DNA alphabet - Githubissues

jltsiren / bwt-merge

A tool for merging large BWTs

MIT License

26 stars 3 forks source link

non-DNA alphabet #3

Open ekg opened 7 years ago

ekg commented 7 years ago

Does bwt-merge work with BWTs from non-DNA alphabets?

Given that it takes SGA's BWT format as input this would presumably require SGA to support non-DNA as well or another BWT format?

I'd like to build a full-text index of all wikipedia revisions. This is something like 100TB uncompressed and it's not clear to me if there is a better way to do this than using an approach based on BWT merging. But if there is I'd be curious to know.

jltsiren commented 7 years ago

BWT-merge assumes alphabet size 6. Byte alphabet would require changes to the BWT encoding, the rank structure, and the optimizations in trie traversal.

Building a BWT for 100 terabytes of data would probably take around 10 CPU years. If you want a single FM-index, you also need enough memory for it on a single system, and the construction will take months. If you can live with p indexes, you can distribute the construction to p systems.