Closed shreevatsa closed 6 months ago
Let's look at what we'd done with the Google OCR response, in https://github.com/shreevatsa/ambuda/blob/line-by-line/ambuda/static/js/pm-editor/pm-editor.ts — writing up in https://docs.google.com/document/d/1isTEALjTOUZvUluE6WBe5y6oP4W2evoaAfOI8NSOdRY/edit
I think 4f1efc3e236cd205848505b9e9b6682a81cab92c should address this. Will try later.
Edit: Seems to be working fine! So we'll call MVP done, and leave it for later to make individual lines.
I think a schema like the following will work:
p
s can be distinct, e.g. can have lines from multiple pages.To display a chunk, the toDom
or NodeView
involves showing the image(s) formed by the union of all the (p, [y1 … y2]) intervals on the left, and the union of the individual lines on the right. (The latter could be continuous text without line breaks, i.e. a single <p>
, in case of paragraphs?)
How these chunks are created: initially, from the OCR response, we identify lines on the page and put each individual line into its own chunk. Later, the user can select a contiguous sequence of lines, and decide to mark those lines as a chunk (paragraph).
for (let word of response.data.words) { console.log(word.text, word.bbox); }
where response
is the Tesseract.RecognizeResult
.
Can request with:
response = await worker.recognize(url, undefined,
{
text: false,
blocks: true,
layoutBlocks: false,
hocr: false,
tsv: false,
box: false,
unlv: false,
osd: false,
pdf: false,
imageColor: false,
imageGrey: false,
imageBinary: false,
debug: false,
}); response.data
I think the next thing is a way to join lines into non-trivial chunks (paragraphs), but for that I'll first start incorporating some of the prosemirror-example-setup.
Adding stuff from prosemirror-example-setup automatically gave a "Join with above block" button, which seems to be useful for paragraphs (not yet tried).
I think the current schema:
https://github.com/shreevatsa/chaya/blob/829a688474ffe9db0f2083df30e4ff63eba92295/main.ts#L107
which is roughly:
doc = chunk*
chunk = (line|heading)*
line = text*
heading = text*
is not very sensible. For example, it allows for chunks that contain some regular lines and some heading lines.
I think other options are:
doc = (paragraph|heading)*
paragraph = line*
heading = line*
line = text*
or
doc = chunk*
chunk = line*
line = text*
where chunk
has the type as an attr (regular paragraph, heading, footnote, verse, etc).
Grouping is there, so I'll call this working.
Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.
For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text: https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L308-L309 https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L344-L346