microsoft / genalog

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
https://microsoft.github.io/genalog/
MIT License
296 stars 29 forks source link

Adding ability to extract a template CSS from a given PDF or image file #43

Open document-intelligence opened 2 years ago

document-intelligence commented 2 years ago

Genalog is great in generating a synthetic document from a given template, but coming up with a template is still a pain.

Wouldn't it be great if I can just point Genalog to a PDF or image, and ask it to synthesize more documents like that?

In other words, can we add the functionality of extracting a CSS template out of a given PDF/image, to complete the cycle?

Thanks!

Document Intelligence

laserprec commented 2 years ago

Hi Ben! Thank you for your suggestions! I totally agree. This would be a great value added to boost Genalog’s utility!

For this feature, I think Layout Parser looks promising to do most of the heavy lifting for extracting layouts, however currently it does not support exporting layouts in HTML format (as of late, it exports layout information in JSON and csv. So there is some feature gaps to fill in before Genalog can consume it as html files.

I am not so aware of any existing document layout standards such that we can reuse/adopt to make this JSON to HTML conversion easy. Would love to get some suggestions if anyone reading this has experience in matter.

jgc128 commented 2 years ago

I'm pretty sure I saw some papers on covering a sketch to an HTML code. I'll have a look at it, it would be a great addition to Genalog