Ideally we'd use as the corpus the actual body of messages in the guild. This will get hard to generate, because we'll have to fetch all messages from each channel to regenerate it, which is expensive (fetch the most recent message, then fetch the 100 messages before that, then keep on repeating until there are no messages left, per channel).
We could store the TFIDF index as a .json file on the VM, and only regenerate it when necessary. Whenever the bot is offline, it might miss some messages, so you'd want to regenerate it to be safe. Another option would be to have maybe a firestore style DB, and keep track of, for each channel, the most recent imported message; when the bot boots it can see where there appear to be new messages and fetch the new ones. You'd need to still potentially regenerate the index sometimes, for example if we changed how we stem words or need some other metadata, but that should be reasonably rare.
Ideally we'd use as the corpus the actual body of messages in the guild. This will get hard to generate, because we'll have to fetch all messages from each channel to regenerate it, which is expensive (fetch the most recent message, then fetch the 100 messages before that, then keep on repeating until there are no messages left, per channel).
We could store the TFIDF index as a
.json
file on the VM, and only regenerate it when necessary. Whenever the bot is offline, it might miss some messages, so you'd want to regenerate it to be safe. Another option would be to have maybe a firestore style DB, and keep track of, for each channel, the most recent imported message; when the bot boots it can see where there appear to be new messages and fetch the new ones. You'd need to still potentially regenerate the index sometimes, for example if we changed how we stem words or need some other metadata, but that should be reasonably rare.