Closed mikegerber closed 2 years ago
Or the tool needs a new name? Let's discuss what makes sense, e.g. libraries for parsing vs. integration
Some initial ideas on what would be relevant in ALTO:
The other thing we should consider: ALTO and images (+ their metadata) concern document pages, not documents. So this is either
I am leaning towards
treated as a separate data source (indexed by document + page)
as only this would allow us the most granular analysis down to page level (e.g. which pages are outliers concerning {insert feature here}, but again, let's discuss this further with the team.
I'm working on an "altotool", using the same techniques we use in modstool to stuff interesting stuff into a pandas DataFrame. This will be indexed by page, so any analysis to be done document-wise needs to aggregate this in meaningful way then (e.g. merge processing info, build sums of line counts etc.)
For some of this stuff there needs to done some aggregating over the page already, e.g. mean/median OCR confidence etc. (not that I think this info will be particularily useful but we shall see.)
* distribution of ALTO elements over a document
@cneud Please clarify this.
There are and have been many use cases for info extracted from ALTO which led me to work on https://github.com/cneud/alto-tools (feel free to reuse what can be).
Please clarify this.
Think e.g. on which pages within a document do certain elements occur vs other pages
mean/median OCR confidence
When certain pages within a document are outliers wrt to the confidence scores, this would be useful to identify and investigate for example
And yeah it needs a name. While I am happy with the innovative name of "modstool", with ALTO functionality it's a bit different
A first version (branch feat/alto
) extracts some of this information, e.g.:
Description_MeasurementUnit pixel
Description_OCRProcessing_ocrProcessingStep0_processingDateTime 2016-08-07
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareCreator ABBYY
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareName ABBYY FineReader Engine
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareVersion 11
Layout_Page_ID Page1
Layout_Page_PHYSICAL_IMG_NR 1
Layout_Page_HEIGHT 2436
Layout_Page_WIDTH 1404
Layout_Page_Page-count 1
Layout_Page_TopMargin-count 1
Layout_Page_LeftMargin-count 1
Layout_Page_RightMargin-count 1
Layout_Page_BottomMargin-count 1
Layout_Page_PrintSpace-count 1
Layout_Page_TextBlock-count 1
Layout_Page_Shape-count 1
Layout_Page_Polygon-count 1
Layout_Page_TextLine-count 40
Layout_Page_String-count 386
Layout_Page_SP-count 345
Layout_Page_HYP-count 8
alto_file alto/734008031/00000035.xml
Layout_Page_GraphicalElement-count NaN
Layout_Page_Illustration-count NaN
Layout_Page_ComposedBlock-count NaN
This includes some counts of elements (*-count
) and also selected attribute values (e.g. Layout_Page_HEIGHT
), more to come.
A bit of a stumbling block is the diversity of ALTO variants we have, so I am going to rework this not to use a fixed XML namespace.
From the first runs I estimate about 48h to run this over all of our (5 million?) ALTO files, which is fine with me.
Latest version in the feature branch now includes descriptive statistics on the word OCR confidence (//alto:String/@WC
as an XPath expression):
Layout_Page_//alto:String/@WC-mean 0.639988
Layout_Page_//alto:String/@WC-median 0.6355
Layout_Page_//alto:String/@WC-std 0.137451
Layout_Page_//alto:String/@WC-min 0.22
Layout_Page_//alto:String/@WC-max 1
Latest version now includes the column alto_xmlns
, which is/translates to the ALTO version used.
Examples from my test data:
alto/PPN636777308/00000002.xml http://schema.ccs-gmbh.com/ALTO
alto/734008031/00000020.xml http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000054.xml http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000098.xml http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000106.xml http://www.loc.gov/standards/alto/ns-v2#
...
alto/749782137/00000554.xml http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000252.xml http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000004.xml http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000849.xml http://www.loc.gov/standards/alto/ns-v2#
alto/weird-ns/00000007.xml http://www.loc.gov/standards/alto/
Name: alto_xmlns, Length: 1314, dtype: object
NER annotated ALTO at SBB looks like this:
There's an alto:Tags
tag that contains the entities (ns0
being ALTO here):
<ns0:Tags>
<ns0:NamedEntityTag ID="PER0" LABEL="Pentlings"/>
<ns0:NamedEntityTag ID="LOC1" LABEL="Pentling"/>
<ns0:NamedEntityTag ID="LOC2" LABEL="Hamm"/>
<ns0:NamedEntityTag ID="PER4" LABEL="Hofes Pentling"/>
<ns0:NamedEntityTag ID="LOC5" LABEL="Hofs Pentling"/>
<ns0:NamedEntityTag ID="LOC7" LABEL="Hilbeck"/>
<ns0:NamedEntityTag ID="PER8" LABEL="Hoff"/>
<ns0:NamedEntityTag ID="PER9" LABEL="L i b e r"/>
<ns0:NamedEntityTag ID="PER10" LABEL="Jhesu Christi"/>
</ns0:Tags>
alto:String
s then reference these:
<ns0:String CONTENT="Hofes" HEIGHT="33" HPOS="914" TAGREFS="PER4" VPOS="1396" WC="0.5019999743" WIDTH="82"/>
<ns0:SP HPOS="997" VPOS="1398" WIDTH="21"/>
<ns0:String CONTENT="Pentling" HEIGHT="34" HPOS="1019" TAGREFS="PER4" VPOS="1398" WC="0.5337499976" WIDTH="129"/>
<ns0:SP HPOS="1149" VPOS="1407" WIDTH="19"/>
Latest master now counts the above NEs in Tags_NamedEntityTag-count
.
We now count all Strings with TAGREFS in Layout_Page_//alto:String[@TAGREFS]-count
(Weird naming comes from the XPath expression used). Some tagged entities span multiple String elements, not sure if and what to do about that.
We now count all Strings with TAGREFS in
Layout_Page_//alto:String[@TAGREFS]-count
(Weird naming comes from the XPath expression used). Some tagged entities span multiple String elements, not sure if and what to do about that.
TAGREFS is also used in some ALTO files to reference LayoutTag
s in TextBlock
elements (not String
s). So technically these counts could count reference tags that are not NamedEntityTag
s.
However, I don't think it's currently worth the effort to check if the TAGREFS
actually reference NEs and just leave it this way until we need this checking. @labusch @cneud Opinions?
Language attributes are LANG
and the deprecated language
:
TextBlock/@language
tags are usedLanguage attributes are
LANG
and the deprecatedlanguage
:* https://www.loc.gov/standards/alto/v4/alto-4-3.xsd * In my test data, only the deprecated `TextBlock/@language` tags are used
Moved this to #18.
<LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>
I've deviced to ignore this for now.
Should this be in here? Or in "codename altotool"?
What info would be relevant? What would be metadata, what would be data (count words?)
[x] Include metadata from the
Description
section[x] Include descriptive statistics for the
Layout
section etc.[x] When that's done review the comments below for things we may have missed
[x] Test using all available versions of ALTO
[x] NER annotated ALTO should at least be identifiable
[x] Include ALTO version/namespace
[x]
<LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>
[x] Any language infos?
[x] Update README that we now support ALTO