How to convert transcript offsets to time?

jmatthewpryor commented 2 years ago

Hi & thanks for the library

I am wondering if you can tell me how to convert the offsets into actual time e.g.

  transcripts: [
    {
      start_offset: 52800,
      end_offset: 493440,

Do you know what measurement system those values are in? How would you convert them into seconds or minutes from the start of the recording? Thanks for any help you can provide

omerdn1 commented 2 years ago

Hey @jmatthewpryor, and thank you very much for your interest in the library :)

These values are in milliseconds. If you want to display the minutes and seconds of each offset you could do something like this:

const date = new Date(start_offset);
console.log(`${date.getMinutes()}:${date.getSeconds()}`);

Let me know if that helps.

jmatthewpryor commented 2 years ago

Thanks @omerdn1 - BTW I also bought you a coffee for the effort !!

I did assume milliseconds, but somehow that doesn't align with what Otter shows on its site. You can see the first three timecodes are at 0:03, 0:31 & 0:47

Bonus Ep Later-Stage Agtech Startup Wrap, feat Anastasia Volkova, Regrow - Otter ai 2022-01-02 at 8 36 38 pm

If I use the following code

const offsetToTimestamp = (offset?: number) => {
  if (!offset) {
    return "00:00";
  }
  const date = new Date(offset);
  const seconds = date.getSeconds();
  const minutes = date.getMinutes();
  return `${minutes}:${`${seconds}`.padStart(2, "0")}`;
};

I get these results

Successfuly logged in to Otter.ai
{
  text: "Hello and welcome ....",
  start: 52800,
  end: 493440,
  speaker: 'Sarah Nolet'
}
Speaker: Sarah Nolet Start: 0:52 End: 8:13
{
  text: "They invest in ....",
  start: 510720,
  end: 760800,
  speaker: 'Anastasia Volkova'
}
Speaker: Anastasia Volkova Start: 8:30 End: 12:40
{
  text: "That's Anastasia, ....",
  start: 760800,
  end: 1944960,
  speaker: 'Sarah Nolet'
}
Speaker: Sarah Nolet Start: 12:40 End: 32:24

To get times that actually match, I need to do the following:

const offsetToTimestamp = (offset?: number) => {
  if (!offset) {
    return "00:00";
  }
  const totalSeconds = Math.floor(offset / 16000); //@TODO: no idea why this is 16000, lifted code from DVargas Otter Roam extension and it was / 1000
  const seconds = totalSeconds % 60;
  const minutes = Math.floor(totalSeconds / 60);
  return `${minutes}:${`${seconds}`.padStart(2, "0")}`;
};

Which then yields

Successfuly logged in to Otter.ai
{
  text: "Hello and welcome ....",
  start: 52800,
  end: 493440,
  speaker: 'Sarah Nolet'
}
Speaker: Sarah Nolet Start: 0:03 End: 0:30
{
  text: "They invest in ....",
  start: 510720,
  end: 760800,
  speaker: 'Anastasia Volkova'
}
Speaker: Anastasia Volkova Start: 0:31 End: 0:47
{
  text: "That's Anastasia ...",
  start: 760800,
  end: 1944960,
  speaker: 'Sarah Nolet'
}
Speaker: Sarah Nolet Start: 0:47 End: 2:01

The full code for the test pull of a speech is this

async function otterSpeech(options: any) {
  const otterApi = new OtterApi({
    email: `${options.email}`, // Your otter.ai email
    password: `${options.password}`, // Your otter.ai password
  });

  otterApi.init().then(() => {
    otterApi.getSpeech(`${options.speechId}`).then((speech: any) => {
      const the_transcripts = speech.transcripts.map(
        (t: {
          transcript: string;
          start_offset: number;
          end_offset: number;
          speaker_id: string;
        }) => ({
          text: t.transcript,
          start: t.start_offset,
          end: t.end_offset,
          speaker:
            speech.speakers.find((s: { id: any }) => s.id === t.speaker_id)
              ?.speaker_name || "Unknown",
        })
      );
      the_transcripts.forEach((transcript: any) => {
        console.log(transcript);
        console.log(
          `Speaker: ${transcript.speaker} Start: ${offsetToTimestamp(
            transcript.start
          )} End: ${offsetToTimestamp(transcript.end)}`
        );
      });
    });
  });
}

omerdn1 commented 2 years ago

Thank you for your support! It is very encouraging 🙏

Dividing by 16000 is very arbitrary and seems wrong as 1000 ms = 1s. Your code looks correct though so it leads me to believe that it might be a bug on otter's side.

Can you share the full response from the getSpeech request?

jmatthewpryor commented 2 years ago

Hey @omerdn1

output of these lines:

  otterApi.init().then(() => {
    otterApi.getSpeech(`${options.speechId}`).then((speech: any) => {
      console.dir(speech);

Attached as ZIP file

speech-22CY3ZGBBUH445LB.json.zip

Cheers Matthew

coolaj86 commented 2 years ago

Samples per second.

Movies (DVD+): 48khz (48000)
Music (CDs): 44.1khz (44100)
Audiobooks (M4B): 32khz (32000)
Voice Memos (Otter): 16khz (16000)

When you're working heuristically on data (ex: speech-to-text), you tend to want to downsample to the lowest quality possible that preserves the information you need. Less is more.

omerdn1 commented 1 year ago

@coolaj86 Good catch!

omerdn1 / otter.ai-api

How to convert transcript offsets to time? #15