Joining lines into regions (chunks)

shreevatsa commented 7 months ago

Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.

For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text: https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L308-L309 https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L344-L346

shreevatsa commented 7 months ago

Let's look at what we'd done with the Google OCR response, in https://github.com/shreevatsa/ambuda/blob/line-by-line/ambuda/static/js/pm-editor/pm-editor.ts — writing up in https://docs.google.com/document/d/1isTEALjTOUZvUluE6WBe5y6oP4W2evoaAfOI8NSOdRY/edit

shreevatsa commented 7 months ago

I think 4f1efc3e236cd205848505b9e9b6682a81cab92c should address this. Will try later.

Edit: Seems to be working fine! So we'll call MVP done, and leave it for later to make individual lines.

shreevatsa commented 7 months ago

I think a schema like the following will work:

The OCR response has words (each of them a bounding box)
Thes are grouped into lines — a line is a minimal (pageNum, [y1 … y2]) tuple such that for every word on page pageNum, either y1 ≤ w.ymin ≤ w.ymax ≤ y2 (the word is on the line), or the fraction of the word ≥ y1 or ≤ y2 (overlapping the line) is at most some fraction f (=0.5?)
A chunk is a group of lines (can be a paragraph or verse for example, depending on "obeylines" i.e. whether the linebreaks are significant), and an optional label. (This label we can use for footnotes made of multiple chunks, for example.) Note that the ps can be distinct, e.g. can have lines from multiple pages.
The document is a sequence of chunks.

To display a chunk, the toDom or NodeView involves showing the image(s) formed by the union of all the (p, [y1 … y2]) intervals on the left, and the union of the individual lines on the right. (The latter could be continuous text without line breaks, i.e. a single <p>, in case of paragraphs?)

How these chunks are created: initially, from the OCR response, we identify lines on the page and put each individual line into its own chunk. Later, the user can select a contiguous sequence of lines, and decide to mark those lines as a chunk (paragraph).

shreevatsa commented 7 months ago

for (let word of response.data.words) { console.log(word.text, word.bbox); }

where response is the Tesseract.RecognizeResult.

Can request with:

response = await worker.recognize(url, undefined, 
            {
                text: false,
                blocks: true,
                layoutBlocks: false,
                hocr: false,
                tsv: false,
                box: false,
                unlv: false,
                osd: false,
                pdf: false,
                imageColor: false,
                imageGrey: false,
                imageBinary: false,
                debug: false,
            }); response.data

shreevatsa commented 6 months ago

I think the next thing is a way to join lines into non-trivial chunks (paragraphs), but for that I'll first start incorporating some of the prosemirror-example-setup.

shreevatsa commented 6 months ago

Adding stuff from prosemirror-example-setup automatically gave a "Join with above block" button, which seems to be useful for paragraphs (not yet tried).

shreevatsa commented 6 months ago

I think the current schema:

https://github.com/shreevatsa/chaya/blob/829a688474ffe9db0f2083df30e4ff63eba92295/main.ts#L107

which is roughly:

doc = chunk*
chunk = (line|heading)*
line = text*
heading = text*

is not very sensible. For example, it allows for chunks that contain some regular lines and some heading lines.

I think other options are:

Option 1

doc = (paragraph|heading)*
paragraph = line*
heading = line*
line = text*

or

Option 2

doc = chunk*
chunk = line*
line = text*

where chunk has the type as an attr (regular paragraph, heading, footnote, verse, etc).

shreevatsa commented 6 months ago

Grouping is there, so I'll call this working.

shreevatsa / chaya

Joining lines into regions (chunks) #7

Option 1

Option 2