shreevatsa / chaya

0 stars 0 forks source link

Joining lines into regions (chunks) #7

Closed shreevatsa closed 2 months ago

shreevatsa commented 2 months ago

Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.

For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text: https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L308-L309 https://github.com/shreevatsa/chaya/blob/d60d4d244b8530334504a3739bf91bf97138e5fb/main.ts#L344-L346

shreevatsa commented 2 months ago

Let's look at what we'd done with the Google OCR response, in https://github.com/shreevatsa/ambuda/blob/line-by-line/ambuda/static/js/pm-editor/pm-editor.ts — writing up in https://docs.google.com/document/d/1isTEALjTOUZvUluE6WBe5y6oP4W2evoaAfOI8NSOdRY/edit

shreevatsa commented 2 months ago

I think 4f1efc3e236cd205848505b9e9b6682a81cab92c should address this. Will try later.

Edit: Seems to be working fine! So we'll call MVP done, and leave it for later to make individual lines.

shreevatsa commented 2 months ago

I think a schema like the following will work:


To display a chunk, the toDom or NodeView involves showing the image(s) formed by the union of all the (p, [y1 … y2]) intervals on the left, and the union of the individual lines on the right. (The latter could be continuous text without line breaks, i.e. a single <p>, in case of paragraphs?)


How these chunks are created: initially, from the OCR response, we identify lines on the page and put each individual line into its own chunk. Later, the user can select a contiguous sequence of lines, and decide to mark those lines as a chunk (paragraph).

shreevatsa commented 2 months ago
for (let word of response.data.words) { console.log(word.text, word.bbox); }

where response is the Tesseract.RecognizeResult.

Can request with:

response = await worker.recognize(url, undefined, 
            {
                text: false,
                blocks: true,
                layoutBlocks: false,
                hocr: false,
                tsv: false,
                box: false,
                unlv: false,
                osd: false,
                pdf: false,
                imageColor: false,
                imageGrey: false,
                imageBinary: false,
                debug: false,
            }); response.data
shreevatsa commented 2 months ago

I think the next thing is a way to join lines into non-trivial chunks (paragraphs), but for that I'll first start incorporating some of the prosemirror-example-setup.

shreevatsa commented 2 months ago

Adding stuff from prosemirror-example-setup automatically gave a "Join with above block" button, which seems to be useful for paragraphs (not yet tried).

shreevatsa commented 2 months ago

I think the current schema:

https://github.com/shreevatsa/chaya/blob/829a688474ffe9db0f2083df30e4ff63eba92295/main.ts#L107

which is roughly:

doc = chunk*
chunk = (line|heading)*
line = text*
heading = text*

is not very sensible. For example, it allows for chunks that contain some regular lines and some heading lines.

I think other options are:

Option 1
doc = (paragraph|heading)*
paragraph = line*
heading = line*
line = text*

or

Option 2
doc = chunk*
chunk = line*
line = text*

where chunk has the type as an attr (regular paragraph, heading, footnote, verse, etc).

shreevatsa commented 2 months ago

Grouping is there, so I'll call this working.