Open arainey2022 opened 1 year ago
Right now, it's splitting text into chunks of < 100 "tokens" where a token is approximately a character. It's really basic right now and can definitely be improved.
Relevant code: https://github.com/transitive-bullshit/yt-semantic-search/blob/main/src/server/openai.ts#L40
Appreciate you sharing that! I've taken a similar approach for now as well. Paragraph splitting just didn't work out very well when I tried it...
Hi there, love what you've built. Very cool use case for a great podcast :)
I was wondering, how did you split up the transcripts text? Did you experiment with sentences, paragraphs or just text blocks?
Starting to play with this, but keep finding different best practise on text splitting.