kermitt2 / pdfalto

PDF to XML ALTO file converter
GNU General Public License v2.0
216 stars 70 forks source link

Option for generating extracted svg graphics only #128

Open kermitt2 opened 3 years ago

kermitt2 commented 3 years ago

By default pdfalto extracts both embedded bitmaps and vector graphics. The option -noImage avoids extracting both graphics types. However we might want still the vector graphics extracted and not the bitmap images, because bitmap image extraction can be time consuming and is often not really required by further processing (bitmap graphic objects with coordinates are present in the ALTO file even when the bitmap is not extracted), while svg files are necessary to further cluster the vector graphics.

Proposal: -noImage -> unchanged, avoid both type of graphics to be extracted -noBitmapImage -> avoid bitmap graphics to be extracted, but still extract vector graphics