Closed shreevatsa closed 3 months ago
Let's do a state machine.
Start state = nothing uploaded.
stateDiagram-v2
[*] --> pdf_picked
pdf_picked --> pdfjs_initialized
pdfjs_initialized --> pdfjs_rendered_1
pdfjs_rendered_1 --> pdfjs_rendered_2
pdfjs_rendered_2 --> pdfjs_rendered_3
pdfjs_rendered_3 --> [*]
Next comes the .sc
file:
stateDiagram-v2
[*] --> pdf_picked
pdf_picked --> pdfjs_initialized
pdfjs_initialized --> pdfjs_rendered_1
pdfjs_rendered_1 --> pdfjs_rendered_2
pdfjs_rendered_2 --> pdfjs_rendered_3
pdfjs_rendered_3 --> [*]
pdf_picked --> sc_picked
pdf_picked --> sc_new
And the PM state:
stateDiagram-v2
[*] --> pdf_picked
pdf_picked --> pdfjs_initialized
pdfjs_initialized --> pdfjs_rendered_1
pdfjs_rendered_1 --> pdfjs_rendered_2
pdfjs_rendered_1 --> ocr_1
pdfjs_rendered_2 --> pdfjs_rendered_3
pdfjs_rendered_2 --> ocr_2
pdfjs_rendered_3 --> pdfjs_done
pdfjs_rendered_3 --> ocr_3
pdfjs_done --> pm_done
pdf_picked --> sc_new
sc_new --> ocr_1
ocr_1 --> ocr_2
ocr_2 --> ocr_3
ocr_3 --> ocr_done
ocr_done --> pm_done
pdf_picked --> sc_picked
sc_picked --> pm_done
pm_done --> [*]
I'm not sure the above is the only way: we may instead choose to start OCR only after all pages are rendered.
It's more that there are a few separate tracks:
I think it makes sense to OCR the canvas element directly, as this is supported: https://github.com/naptha/tesseract.js/blob/master/docs/image-format.md
Made some changes and page is freezing.
See
The freezing was fixed (with 0f8eaa05669870c4c626c24b6477d206bb1b98ef) and the saving is probably fine; what's left is to load.
What do we load? Currently, the save (not very well thought-out) is view.state.doc.toJSON()
. So we'd need to be able to create the PM from this doc?
I think I'll say that save/load are working, and close this. Can create new issues for OCR, and proceed from there.
Before getting further into the code, would be good to implement the feature to:
.sc
file, andPart of this is UI work, and part is the actual serialization / deserialization.