hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 9 forks source link

Actual sentence and sentence_idx #158

Open VeritasJoker opened 1 year ago

VeritasJoker commented 1 year ago

@hvgazula

We are trying to use bert to generate podcast embeddings and want to feed in per sentence input.

Do we have actual sentence and sentence_idx in the datum? Right now, the sentence and sentence_idx columns we have in the datums are for utterances. For podcast, we have sentence = None and sentence_idx = 1 for all tokens.

hvgazula commented 1 year ago

sentence is identified using the Speaker column. Sadly, in podcast, there is only one speaker. Is there a way to identify the utterance?

VeritasJoker commented 1 year ago

I think utterance is identified when a speaker switches, but we just store it inside the sentence column. I am asking if we have a way in the code already to identify individual sentences. For podcast, there are only one utterance but couple hundreds sentences.

hvgazula commented 1 year ago

Okay, I repeat my question with utterance and sentence swapped. :)

VeritasJoker commented 1 year ago

lol I'm guessing we don't have it in the codebase then. I have this hack from another project that recognizes words as the end of a sentence. I basically check if the last character in word is one of those !?., or alternatively, we can check if a token is one of those !?..

This results in about 400 sentences for podcast, which seems reasonable. Do you think this would work?

hvgazula commented 1 year ago

df.word[i][-1] in "!?." or i == len(df.index) - 1 I understand the first condition but not the second one. Could you please clarify? Is it for the "last" sentence in the podcast?

VeritasJoker commented 1 year ago

Oh yeah this is in case the last few words got cut off. I don't think we need it for here.