kba / transkribus-to-prima

Convert Transkribus PAGE-XML to standard PAGE-XML
11 stars 2 forks source link

Fix missing extensions #10

Closed bertsky closed 2 years ago

bertsky commented 2 years ago

Fixes #1, #2, #3 (but we could do more here), #4, #6 and #8.

Here's the current output of -h:

Usage: transkribus-fixer [OPTIONS] INPUT_FILE [OUTPUT_FILE]

  Transform (Transkribus PAGE) INPUT_FILE to (PRImA PAGE) OUTPUT_FILE under
  the chosen fixes.

Options:
  -f, --fixes [image_transform|metadata|reading_order|table|tag_property_link|textequiv|namespace]
                                  Fixes to apply. Repeatable [default: all].

                                  image_transform: Convert
                                  Page/@image(Rotation|Translation|Scaling) to
                                  Labels

                                  metadata: Remove any
                                  Metadata/TranskribusMetadata

                                  reading_order: Convert ???

                                  table: Convert each TableRegion/TableCell
                                  into a TableRegion/TextRegion, writing
                                  row/col index/span as new TableCellRole
                                  accordingly

                                  tag_property_link: Remove Tag, Property and
                                  Link elements whereever they appear

                                  textequiv: Convert any
                                  //TextEquiv/UnicodeAlternative into
                                  additional ../TextEquiv/Unicode

                                  namespace: Also convert PAGE namespace
                                  version from 2013 to 2019.

  -I, --prefer-imgurl             use TranskribusMetadata/@imgUrl for
                                  @imageFilename if available

  -V, --validate                  Validate output against schema.
  -h, --help                      Show this message and exit.
bertsky commented 2 years ago

Note: I also tried implementing the namespace fixer on the etree level instead of string postprocessing, but it seems that lxml does not provide an easy mechanism for changing namespaces (there are no namespace nodes or such, and if you attempt to replace a node.tag, you'll automatically get a new nsprefix like ns0 in the nsmap, which you cannot control or override; etree.cleanup_namespaces only partially helps) – so I gave up on that.

kba commented 2 years ago

Arg, this was closed because I force-pushed. But I still have the original branch, so can apply it later.