pdfliberation / python-hocrgeo

Python tool for converting hOCR files to geographic file formats
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Create hocr parser #1

Open dcloud opened 10 years ago

dcloud commented 10 years ago

Found existing one in Github, but it didn't work. See if we can quick write one of our own.

dcloud commented 10 years ago

Looking at https://gist.github.com/dcloud/9173113, fwiw

dcloud commented 10 years ago

Basics done in 9036cdc651f497e72383c597d23e19ec46095c0d.

jsfenfen commented 10 years ago

Note that spans for words are sometimes ocrx_word and sometimes just ocr_word -- in other words, the x is sometimes missing.

dcloud commented 10 years ago

Ah, I wasn't sure about that. Reopening. Do you know the difference (what the x means)?

jsfenfen commented 10 years ago

I dunno. I'm not sure it's intentional or a bug. But since I ran into this I've gotten more skeptical about how tight the spec is...

On Mon, Feb 24, 2014 at 12:22 PM, Daniel Cloud notifications@github.comwrote:

Ah, I wasn't sure about that. Reopening. Do you know the difference (what the x means)?

Reply to this email directly or view it on GitHubhttps://github.com/pdfliberation/python-hocrgeo/issues/1#issuecomment-35910887 .

jsfenfen commented 10 years ago

Yeah, so fwiw ocrx_word might not be a formal part of the spec -- this doc https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0-- I'm not totally sure of how authoritative it is -- describes it as being part of the 'engine-specific markup'. Which gives me pause...

On Mon, Feb 24, 2014 at 12:22 PM, Daniel Cloud notifications@github.comwrote:

Reopened #1 https://github.com/pdfliberation/python-hocrgeo/issues/1.

Reply to this email directly or view it on GitHubhttps://github.com/pdfliberation/python-hocrgeo/issues/1 .