petermr commented 4 years ago

PMR is currently debugging the PDF2SVG conversion. An alternative (a) for a small number of papers is to use an online servjce such as https://cloudconvert.com/pdf-to-svg and convert the files manually (b) find another OpenSource package.

The cloudconvert acts as a reference for the debugging.

Caveat. The SVG produced may need come transformations to nnormalize the coordinates (e.g. to screen units)

cloudconvert

Creates codepoint-oriented output, but includes some transformation matrices. Not open source. Not a long-term solution. (Could write a compacter routine to make the SVG more tractable).

pdf2svg

http://www.cityinthesky.co.uk/opensource/pdf2svg/ creates transformed paths but creates characters as paths not codepoints. The paths could be a useful reference.

PDFBox-AMI

A little while to go. Not picking up graphics state at right places. Might combine with pdf2svg (very messy)

petermr commented 4 years ago

"almost" correct character and path extraction

The latest commit has been used to create SVG files from lichtenburg19a in https://github.com/petermr/ami3/tree/master/src/test/resources/org/contentmine/ami/pdf2svg2 For each page there are a PNG and SVG. The SVG is "almost" correct in that it has all the characters and all the strokes but occasionally has failed to pick up the correct stroke or fill colour. If the colors are ignored then the data are good enough to extract the Caches from.

I will hope to fix the stroke/fill bug in the next 1/2/ days but data extraction could be tested now.

petermr commented 4 years ago

"Correct" character and path extraction

The latest commit of PDFBox-AMI can now extract all characters and paths "correctly" (i.e. haven't checked). With some bugs:

bugs

rotated characters

Fontsize not set, transform matrix not correct, need to add angle of rotation.

colours

Colours seem to lag behind or before correct values - seem to happen for both text and paths. UPDATE nearly correct for paths, fill still needs fixing

non-unicode characters

Need to add "unknown glyph" for non-unicode chars such as large Sigma, Pi, etc.

petermr / ami3

Create correct SVG from ML papers #12

cloudconvert

pdf2svg

PDFBox-AMI

"almost" correct character and path extraction

"Correct" character and path extraction

bugs

rotated characters

colours

non-unicode characters