pietrop / slate-transcript-editor

A React component to make correcting automated transcriptions of audio and video easier and faster. Using the SlateJs editor.
https://pietrop.github.io/slate-transcript-editor
Other
73 stars 33 forks source link

Unbounded word duplication under certain conditions #52

Closed jshearer closed 3 years ago

jshearer commented 3 years ago

Describe the bug Under some circumstance that I haven't managed to pin down yet, a word will get created whose start time is >= its end time. When this happens, during alignment, the word gets duplicated, meaning that each time you align, your word duplicates grow exponentially.

I ended up writing a failing test for this in the theirstory repo, but since tests don't run here, I'll just include the failing test here:

const simpleDpe = {
  paragraphs: [
    {
      speaker: "A",
      start: 0,
      end: 1,
    },
    {
      speaker: "B",
      start: 1,
      end: 2,
    },
  ],
  words: [
    {
      end: 1,
      start: 1,
      text: "the",
    },
  ],
};

import { expect } from "chai";
import convertSlateToDpe from "slate-transcript-editor/util/export-adapters/slate-to-dpe";
import convertDpeToSlate from "slate-transcript-editor/util/dpe-to-slate";
import updateBloocksTimestamps from "slate-transcript-editor/util/export-adapters/slate-to-dpe/update-timestamps/update-bloocks-timestamps";

describe("Alignment", () => {
  it("Should not duplicate words", () => {
    const slate = convertDpeToSlate(simpleDpe);
    const aligned = updateBloocksTimestamps(slate);
    const newDpe = convertSlateToDpe(aligned);
    expect(newDpe?.words).to.eql(simpleDpe?.words); // this fails -- the word is duplicated
  });
});
pietrop commented 3 years ago

This has been addressed, and possibly fixed in the underlying alignment module https://github.com/pietrop/stt-align-node/pull/8

more info https://github.com/pietrop/digital-paper-edit-electron/issues/74#issuecomment-844404845

So closing this issue. But can create a new one if this is or something similar shows up again.