Open petermr opened 4 years ago
The latest commit has been used to create SVG files from lichtenburg19a in https://github.com/petermr/ami3/tree/master/src/test/resources/org/contentmine/ami/pdf2svg2 For each page there are a PNG and SVG. The SVG is "almost" correct in that it has all the characters and all the strokes but occasionally has failed to pick up the correct stroke or fill colour. If the colors are ignored then the data are good enough to extract the Caches from.
I will hope to fix the stroke/fill bug in the next 1/2/ days but data extraction could be tested now.
The latest commit of PDFBox-AMI can now extract all characters and paths "correctly" (i.e. haven't checked). With some bugs:
Fontsize not set, transform matrix not correct, need to add angle of rotation.
Colours seem to lag behind or before correct values - seem to happen for both text and paths. UPDATE nearly correct for paths, fill still needs fixing
Need to add "unknown glyph" for non-unicode chars such as large Sigma, Pi, etc.
PMR is currently debugging the PDF2SVG conversion. An alternative (a) for a small number of papers is to use an online servjce such as https://cloudconvert.com/pdf-to-svg and convert the files manually (b) find another OpenSource package.
The
cloudconvert
acts as a reference for the debugging.Caveat. The SVG produced may need come transformations to nnormalize the coordinates (e.g. to screen units)
cloudconvert
Creates codepoint-oriented output, but includes some transformation matrices. Not open source. Not a long-term solution. (Could write a compacter routine to make the SVG more tractable).
pdf2svg
http://www.cityinthesky.co.uk/opensource/pdf2svg/ creates transformed paths but creates characters as paths not codepoints. The paths could be a useful reference.
PDFBox-AMI
A little while to go. Not picking up graphics state at right places. Might combine with pdf2svg (very messy)