Open VeritasJoker opened 1 year ago
sentence
is identified using the Speaker
column. Sadly, in podcast, there is only one speaker. Is there a way to identify the utterance?
I think utterance is identified when a speaker
switches, but we just store it inside the sentence
column. I am asking if we have a way in the code already to identify individual sentences. For podcast, there are only one utterance but couple hundreds sentences.
Okay, I repeat my question with utterance and sentence swapped. :)
lol I'm guessing we don't have it in the codebase then. I have this hack from another project that recognizes words as the end of a sentence. I basically check if the last character in word
is one of those !?.
, or alternatively, we can check if a token is one of those !?.
.
This results in about 400 sentences for podcast, which seems reasonable. Do you think this would work?
df.word[i][-1] in "!?." or i == len(df.index) - 1
I understand the first condition but not the second one. Could you please clarify? Is it for the "last" sentence in the podcast?
Oh yeah this is in case the last few words got cut off. I don't think we need it for here.
@hvgazula
We are trying to use bert to generate podcast embeddings and want to feed in per sentence input.
Do we have actual sentence and sentence_idx in the datum? Right now, the
sentence
andsentence_idx
columns we have in the datums are for utterances. For podcast, we havesentence = None
andsentence_idx = 1
for all tokens.