hzsk / HZSK-CLARIN-Services

1 stars 0 forks source link

TCF output of web services not recognised by WebLicht as TCF #9

Open berndmoos opened 5 years ago

berndmoos commented 5 years ago

... and this makes it impossible to really use the TCF services on converted ISO/TEI data.

I suspect the reason is the mime type. The metadata for isotei2tcf (https://corpora.uni-hamburg.de/hzsk/de/islandora/object/webservice:isotei2tcfconverter-0.9/datastream/CMDI) specifies the following as output:

application/xml;format-variant=weblicht-tcf

This is what we wanted, but didn't get (see issue#6). In https://github.com/hzsk/HZSK-CLARIN-Services/blob/2bc7e9ee2f4c5de79a8401a6f2f4cb76b4ee6839/src/main/java/de/uni_hamburg/converters/IsoTeiConverter.java#L287, @Produces is given as:

text/tcf+xml

I think this is what the metadata should use as mime type for the output. Can somebody change that?

Likewise, in...

https://corpora.uni-hamburg.de/hzsk/de/islandora/object/webservice:tcf2isoteiconverter-0.9/datastream/CMDI

... the input mime type should change.

flammie commented 5 years ago

Should be text/tcf+xml now.

berndmoos commented 5 years ago

image

Waiting for the change to take effect... Stay tuned.

berndmoos commented 5 years ago

The change does not seem to be recognised by WebLicht. The monitoring page says that "7 services were retained" at the last harvest (https://weblicht.sfs.uni-tuebingen.de/harvester/resources/report). I suspect some action has to be taken so that the services are updated instead of just retained.

image

berndmoos commented 5 years ago

Asked a question on the list...

berndmoos commented 5 years ago

The output mime type is changed now...

image

... and it is the same as for other services with TCF as an output...

image

... but WebLicht still does not offer other services with TCF as input.

berndmoos commented 5 years ago

I guess the TCF converter is somehow underspecified in the CMDI. We will maybe need to add lang etc., see http://weblicht.sfs.uni-tuebingen.de/comet/editor.jsp?id=1541449788338

berndmoos commented 5 years ago

Or better: http://weblicht.sfs.uni-tuebingen.de/comet/api/resources/cmd?id=1541450030475

flammie commented 5 years ago

The links have expired, I added lang parameter de but I didn't force re-indexing yet

berndmoos commented 5 years ago

This one should be a model for specifying the output parameters:

http://weblicht.sfs.uni-tuebingen.de/fedora/objects/WLWS:3/datastreams/CMDI/content

flammie commented 5 years ago

Ok, I copypasted that for a test

berndmoos commented 5 years ago

I think it would be more efficient if HZSK could test the changes directly. Here's a recipe for testing:

(0) Modify CMDI and wait until WebLicht has harvested it (should take around 2h according to Tübingen) (1) Go to WebLicht at https://weblicht.sfs.uni-tuebingen.de/ (2) Start, login, start (3) Choose "Upload a file" and pick an EXMARaLDA Basic Transcription (*.exb) - I use RudiVoellerWutausbruch.exb (4) Pick the appropriate segmentation algorithm and language - in my case: "hiat" and "deutsch" (5) check "Show tools with status: development" (6) Add service "IDS, HZSK: EXMARaLDA to ISO/TEI converter" to the chain (7) Add service "IDS, HZSK: ISO/TEI to TCF" to the chain

What we want is that WebLicht then offers TCF-based services for the next step. Currently, no services are offered.

flammie commented 5 years ago

Excellent idea, I've played around a bit now, I think it might be the language thing but I still can't get the languages to work around, like with other chains the boxes will contain languages but here it just goes from deutsch to nothing to unknown, even though I copied the input and output parametres, I will continue experimenting...

flammie commented 5 years ago

The language didn't fix it (alone) but adding version or "text" did,

berndmoos commented 5 years ago

Better, but not quite there yet. What WebLicht now offers is a bunch of tokenizers, although the TCF is already tokenized. We'll probably have to add "sentences" and "tokens" to the output as well...

flammie commented 5 years ago

Now it's text sentences tokens and IMS morphology works at least for a trivial small file.

berndmoos commented 5 years ago

It also works for my favourite test files, so I'd venture to say, this issue can be closed. However, there is a similar issue in the mirror operation, so I am opening a mirror issue: issue #10