proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Interoperability with Salt, Pepper and ANNIS #85

Open proycon opened 4 years ago

proycon commented 4 years ago

This issue came up in discussions with @luutuntin who was looking for a search and retrieval tool capable of handling FoLiA. There is some FoLiA support in both Blacklab and MTAS, but both may not sufficiently cover all of FoLiA's expressive abilities (tree handling in particular).

ANNIS is another well-developed and interesting solution, but right now there is no FoLiA support. ANNIS relies on a conversion tool called Pepper to support a great variety of input formats. Pepper in turn uses a low-level graph-based model called Salt as its intermediate model, which in turn can export to a variety of formats again (including ANNIS' format).

To enhance interoperability, it would be a good idea to implement conversion from FoLiA to the salt model (and possibly vice versa, but with much less priority)

To write such a converter we could: 1) implement it as an extension to Pepper, however: Pepper and Salt are all Java-based, but we have no proper java-based FoLiA library (and I'm very reluctant to start one, we already have extensive libraries for Python, C++ and Rust). 2) Implement it as a standalone tool, possibly serialising to SaltXML . This allows us to leverage an existing FoLiA library (although we lose the benefit of the Salt library), and keeps things a bit simpler.

Update: we are picking option 2

luutuntin commented 4 years ago

Information about SaltXML (XMI) format

Most relevant:

Useful:

proycon commented 4 years ago

Here's an additional salt example, converted from the TCF v0.4 example on their website: https://download.anaproy.nl/tcf04-karin-wl.salt

luutuntin commented 4 years ago

And this is an example of document-structure (vs corpus-structure), I suppose.

proycon commented 4 years ago

This comment tracks the current state of the folia2salt implementation in foliatools. Not all is a priority and some may not be implemented for the time being:

proycon commented 4 years ago

I think I have a decent convertor implementation now. The big question now is if my resulting Salt XML is actually valid and can be parsed by Pepper. Testing that will be the next step (pepper seems to have a Salt Validator so that should help). In order to do that though, I can't get around writing a sCorpusGraph in a file called saltProject.salt .

Next step after that is to see if pepper's annis conversion is actually usable (or other conversions for that matter), I'll leave that part up to @luutuntin if you don't mind :)

I'm certainly not expecting any loss-less conversions when converting from this to all of the output formats pepper supports. It's hard to do that through an intermediate format without knowing the specifics of the input and output format.

proycon commented 4 years ago

Well, now the conversion is done I'm trying to get things to validate and process with pepper, and hopefully resolve any issues that I got wrong in my convertor. This proves to be much more difficult than I had anticipated as I can't even get pepper to import Salt XML properly: I'm a bit stuck at this point.

I tried building a conversion/validation workflow with three steps, a SaltXML importer, a SaltValidator and a DoNothingExporter. It doesn't look like any documents get processed (it says 0 of 4, how it gets the number '4' is a mystery to me as there is only one document in my test corpus).

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 initializing
- no documents found to display progress -
-------------------------------------------------------------------------

+----------------------------------- step 1 -----------------------------------+
|importer:      SaltXMLImporter                                                |
|path:          file:/home/proycon/exp/pepper/saltin/                          |
|corpus index:  0                                                              |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 2 -----------------------------------+
|manipulator:   SaltValidator                                                  |
|path:          null                                                           |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+----------------------------------- step 3 -----------------------------------+
|exporter:      DoNothingExporter                                              |
|path:          file:/home/proycon/exp/pepper/saltout/                         |
|properties:                                                                   |
|               pepper.after.reportCorpusGraph:false                                 |
|               pepper.after.tokenize:   false                                 |
|                                                                              |
+------------------------------------------------------------------------------+

--------------------------- pepper job status ---------------------------
id:                     'la7st384
active documents:       0 of 4
status:                 ended
- no documents found to display progress -
-------------------------------------------------------------------------

Unfortunately, there's not really any validation information to go by yet, so I set out to test a similar pepper pipeline by reimporting salt XML pepper itself outputted (conversion from TCF source). I get almost exactly the same output (0 of 4 documents)...

My initial test corpus (one document) outputted by the new converter: https://download.anaproy.nl/foliasalt.tar.gz

proycon commented 4 years ago

^--- I cross-posted the issue, with some further context, to the pepper issue tracker.

ghost commented 4 years ago

I just looked into https://download.anaproy.nl/foliasalt.tar.gz and found that saltProject.salt is a sDocumentStructure instead of sCorpusStructure:

<?xml version='1.0' encoding='utf-8'?>
<saltCommon:SaltProject xmlns:sDocumentStructure="sDocumentStructure" xmlns:xmi="http://www.omg.org/XMI" xmlns:saltCore="saltCore" xmlns:saltCommon="saltCommon" xmlns:sCorpusStructure="sCorpusStructure" xmi:version="2.0">
proycon commented 4 years ago

It just declares the sDocumentStructure namespace with an identical prefix (which isn't really used indeed in this context, but its presence should be irrelevant), the root tag itself is in the saltCommon namespace. (the way Salt uses XML namespaces is a bit weird though, they only pertain to some elements and they are not proper URIs). There is no default XML namespace set in any of the examples.

ghost commented 4 years ago

Thank you.

luutuntin commented 4 years ago

When I compare the foliasalt corpus and other examples, foliasalt doesn't have xmlns:xsi, and therefore uses xmi:type, instead of xsi:type. Does this difference matter?

proycon commented 4 years ago

Good point! That's a definitely mistake on my part indeed. These are precisely the things I'd hope a good validator would catch. I'll fix it.

I don't think it's the root cause of the pepper issue because that one also fails if I try to reimport the TCF->Salt corpus.

proycon commented 4 years ago

ok, that did help! we have some progress! The original error is gone (for now) and I get a java traceback error, so it's definitely trying to parse more. The feedback isn't very verbose unfortunately so it'll be a bit tricky to pinpoint exactly where the culprit is.

full stack trace:
org.corpus_tools.pepper.modules.exceptions.PepperModuleException: Failed to import corpus by module. Nested exception was:
        at org.corpus_tools.pepper.core.PepperJobImpl.importCorpusStructures(PepperJobImpl.java:594)
        at org.corpus_tools.pepper.core.PepperJobImpl.convert(PepperJobImpl.java:930)
        at org.corpus_tools.pepper.cli.PepperStarter.convert(PepperStarter.java:534)
        at org.corpus_tools.pepper.cli.PepperStarter.main(PepperStarter.java:1437)
Caused by: org.corpus_tools.salt.exceptions.SaltResourceException: Cannot find a target node '//@nodes.1' for relation.
        at org.corpus_tools.salt.util.internal.persistence.SaltXML10Handler.startElement(SaltXML10Handler.java:247)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:510)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanStartElement(XMLDocumentFragmentScannerImpl.java:1397)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2710)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
        at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
        at org.corpus_tools.salt.util.SaltUtil.loadObjects(SaltUtil.java:483)
        at org.corpus_tools.salt.util.SaltUtil.load(SaltUtil.java:434)
        at org.corpus_tools.salt.util.SaltUtil.loadCorpusGraph(SaltUtil.java:720)
        at org.corpus_tools.salt.util.SaltUtil.loadCorpusGraph(SaltUtil.java:687)
        at org.corpus_tools.salt.common.impl.SCorpusGraphImpl.load(SCorpusGraphImpl.java:372)
        at org.corpus_tools.pepper.modules.coreModules.SaltXMLImporter.importCorpusStructure(SaltXMLImporter.java:106)
        at org.corpus_tools.pepper.core.ModuleControllerImpl$1.run(ModuleControllerImpl.java:245)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
proycon commented 4 years ago

Ok, further progress, the one above is solved too. I missed a few xsi:type attributes. As long as I get parsing tracebacks now I can hopefully pinpoint and fix it.

luutuntin commented 4 years ago

Great. We are moving steadily.

proycon commented 4 years ago

I solved a few parser errors and now I'm back at the same '0 of 4' situation we started with. But at least now I can be assured that it did some parsing (even though the output doesn't really show that). I'll try some conversion (e.g. annis) to see how that looks.

PS: I updated https://download.anaproy.nl/foliasalt.tar.gz with the new results.

proycon commented 4 years ago

The annis conversion seems to work although pepper does raise one exception for which the cause is unclear to me:

Exception in thread "pool-5-thread-1" org.corpus_tools.salt.exceptions.SaltException: An exception occured while traversing the graph 'salt:/foliacorpus/example.deep' with path 'null'. because of null.
        at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:486)
        at org.corpus_tools.salt.core.impl.GraphTraverserModule.traverse(GraphTraverserModule.java:173)
        at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:241)
        at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:232)
        at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.run(SSpanningRelation2ANNISMapper.java:82)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.writeNodeTabEntry(SRelation2ANNISMapper.java:742)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:715)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:496)
        at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.nodeReached(SRelation2ANNISMapper.java:310)
        at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.nodeReached(SSpanningRelation2ANNISMapper.java:162)
        at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:391)
        ... 7 more

I tried some other conversions too:

Unfortunately the error messages are often too cryptic and make no clear reference to the actual salt input that failed. The Salt validator in Pepper also didn't lead to any output, so I assume it considers everything okay.

proycon commented 4 years ago

A first version of folia2salt is now released as part of foliatools v2.3.0 , it is still to be considered highly experimental, though.

parkervg commented 3 years ago

Hi,

I've tried taking the example foliasalt document (https://download.anaproy.nl/foliasalt.tar.gz) and ran it through a simple Pepper workflow file to convert SaltXML to Annis, salt_to_annis.pepper.zip.

Running that, we still get the same ominous '0 of 4' message, in addition to some other error logs:

Exception in thread "pool-4-thread-1" org.corpus_tools.salt.exceptions.SaltException: An exception occured while traversing the graph 'salt:/foliacorpus/example.deep' with path 'null'. because of null. at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:486) at org.corpus_tools.salt.core.impl.GraphTraverserModule.traverse(GraphTraverserModule.java:173) at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:241) at org.corpus_tools.salt.core.impl.SGraphImpl.traverse(SGraphImpl.java:232) at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.run(SSpanningRelation2ANNISMapper.java:82) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) Caused by: java.lang.NullPointerException at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.writeNodeTabEntry(SRelation2ANNISMapper.java:742) at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:715) at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.mapSNode(SRelation2ANNISMapper.java:496) at org.corpus_tools.peppermodules.annis.SRelation2ANNISMapper.nodeReached(SRelation2ANNISMapper.java:310) at org.corpus_tools.peppermodules.annis.SSpanningRelation2ANNISMapper.nodeReached(SSpanningRelation2ANNISMapper.java:162) at org.corpus_tools.salt.core.impl.GraphTraverserModule$Traverser.run(GraphTraverserModule.java:391) ... 7 more replaced invalid ANNIS identifier FoLiA::pos::https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn with FoLiA%3A%3Apos%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2Fproycon%2Ffolia%2Fmaster%2Fsetdefinitions%2Ffrog-mbpos-cgn replaced invalid ANNIS identifier FoLiA::token::https://raw.githubusercontent.com/LanguageMachines/uctodata/folia1.4/setdefinitions/tokconfig-nld.foliaset.ttl with FoLiA%3A%3Atoken%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2FLanguageMachines%2Fuctodata%2Ffolia1%2E4%2Fsetdefinitions%2Ftokconfig-nld%2Efoliaset%2Ettl replaced invalid ANNIS identifier feature/head with feature%2Fhead replaced invalid ANNIS identifier feature/spectype with feature%2Fspectype replaced invalid ANNIS identifier FoLiA::lemma::https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mblem-nl with FoLiA%3A%3Alemma%3A%3Ahttps%3A%2F%2Fraw%2Egithubusercontent%2Ecom%2Fproycon%2Ffolia%2Fmaster%2Fsetdefinitions%2Ffrog-mblem-nl replaced invalid ANNIS identifier feature/pvtijd with feature%2Fpvtijd replaced invalid ANNIS identifier feature/pvagr with feature%2Fpvagr replaced invalid ANNIS identifier feature/wvorm with feature%2Fwvorm replaced invalid ANNIS identifier feature/vztype with feature%2Fvztype replaced invalid ANNIS identifier feature/npagr with feature%2Fnpagr replaced invalid ANNIS identifier feature/lwtype with feature%2Flwtype replaced invalid ANNIS identifier feature/naamval with feature%2Fnaamval replaced invalid ANNIS identifier feature/positie with feature%2Fpositie replaced invalid ANNIS identifier feature/numtype with feature%2Fnumtype replaced invalid ANNIS identifier feature/conjtype with feature%2Fconjtype replaced invalid ANNIS identifier feature/getal with feature%2Fgetal replaced invalid ANNIS identifier feature/genus with feature%2Fgenus replaced invalid ANNIS identifier feature/ntype with feature%2Fntype replaced invalid ANNIS identifier feature/graad with feature%2Fgraad replaced invalid ANNIS identifier feature/buiging with feature%2Fbuiging replaced invalid ANNIS identifier feature/vwtype with feature%2Fvwtype replaced invalid ANNIS identifier feature/persoon with feature%2Fpersoon replaced invalid ANNIS identifier feature/pdtype with feature%2Fpdtype replaced invalid ANNIS identifier feature/status with feature%2Fstatus replaced invalid ANNIS identifier feature/getal-n with feature%2Fgetal-n

We have an Annis deployment up at https://annis.ling.brandeis.edu/annis-gui/ that you can check out; the Annis corpus from the resulting Pepper conversion is loaded up as foliacorpus. Despite it being loaded up without an error, you can see that the node annotations are malformed:

Screen Shot 2021-01-11 at 1 27 10 PM

I'll try doing some digging into why this happens, but wanted to add a log here with all these resources in case you have any ideas for fixing this behavior.

proycon commented 3 years ago

Thanks for the feedback. As you see the conversion is still very experimental. I encode things pretty verbosely in Salt, and use their namespace functionality extensively to encode both the annotation types as well as the FoLiA sets, as I want to preserve as much data as possible. Salt is not very prescriptive so I took a some liberties without knowing exactly how they would translate in further conversion steps. There is even some duplication in the data, you might already have enough information if you just look at the 'simplified' annotations that are in the salt namespace with name pos, lemma etc, these kind of summarize some of the more complex fields. You can even instruct folia2salt to only output these simplified annotations and omit all others (but you will lose information and there may be clashes if there are for example multiple pos tags in multiple sets).

So I wouldn't say the annotations are 'malformed', but some of these identifiers don't translate well to ANNIS and it seems pepper url-encoded them. I agree they're not very interpretable for end users in this way, does the simplified annotation show in the interface too? (I admit I know virtually nothing about ANNIS itself).