shreevatsa / chaya

0 stars 0 forks source link

Joining lines into regions (chunks) #7

Closed shreevatsa closed 2 months ago

shreevatsa commented 2 months ago

Right now, all the text from each page of OCR is processed into a single blob of text. We should instead use the line-level information and make smaller regions.

For now / MVP, it would be a good start to just get paragraphs into the output, rather than a single paragraph with all the text:

shreevatsa commented 2 months ago

Let's look at what we'd done with the Google OCR response, in — writing up in

shreevatsa commented 2 months ago

I think 4f1efc3e236cd205848505b9e9b6682a81cab92c should address this. Will try later.

Edit: Seems to be working fine! So we'll call MVP done, and leave it for later to make individual lines.

shreevatsa commented 2 months ago

I think a schema like the following will work:

To display a chunk, the toDom or NodeView involves showing the image(s) formed by the union of all the (p, [y1 … y2]) intervals on the left, and the union of the individual lines on the right. (The latter could be continuous text without line breaks, i.e. a single <p>, in case of paragraphs?)

How these chunks are created: initially, from the OCR response, we identify lines on the page and put each individual line into its own chunk. Later, the user can select a contiguous sequence of lines, and decide to mark those lines as a chunk (paragraph).

shreevatsa commented 2 months ago
for (let word of { console.log(word.text, word.bbox); }

where response is the Tesseract.RecognizeResult.

Can request with:

response = await worker.recognize(url, undefined, 
                text: false,
                blocks: true,
                layoutBlocks: false,
                hocr: false,
                tsv: false,
                box: false,
                unlv: false,
                osd: false,
                pdf: false,
                imageColor: false,
                imageGrey: false,
                imageBinary: false,
                debug: false,
shreevatsa commented 2 months ago

I think the next thing is a way to join lines into non-trivial chunks (paragraphs), but for that I'll first start incorporating some of the prosemirror-example-setup.

shreevatsa commented 2 months ago

Adding stuff from prosemirror-example-setup automatically gave a "Join with above block" button, which seems to be useful for paragraphs (not yet tried).

shreevatsa commented 2 months ago

I think the current schema:

which is roughly:

doc = chunk*
chunk = (line|heading)*
line = text*
heading = text*

is not very sensible. For example, it allows for chunks that contain some regular lines and some heading lines.

I think other options are:

Option 1
doc = (paragraph|heading)*
paragraph = line*
heading = line*
line = text*


Option 2
doc = chunk*
chunk = line*
line = text*

where chunk has the type as an attr (regular paragraph, heading, footnote, verse, etc).

shreevatsa commented 2 months ago

Grouping is there, so I'll call this working.