Question (not an issue): How are you splitting transcript text?

transitive-bullshit / yt-semantic-search

OpenAI-powered semantic search for any YouTube playlist – featuring the All-In Podcast. 💪

https://all-in-on-ai.vercel.app

MIT License

519 stars 44 forks source link

Question (not an issue): How are you splitting transcript text? #3

Open arainey2022 opened 1 year ago

arainey2022 commented 1 year ago

Hi there, love what you've built. Very cool use case for a great podcast :)

I was wondering, how did you split up the transcripts text? Did you experiment with sentences, paragraphs or just text blocks?

Starting to play with this, but keep finding different best practise on text splitting.

transitive-bullshit commented 1 year ago

Right now, it's splitting text into chunks of < 100 "tokens" where a token is approximately a character. It's really basic right now and can definitely be improved.

Relevant code: https://github.com/transitive-bullshit/yt-semantic-search/blob/main/src/server/openai.ts#L40

arainey2022 commented 1 year ago

Appreciate you sharing that! I've taken a similar approach for now as well. Paragraph splitting just didn't work out very well when I tried it...