Unable to import corpus in relANNIS format

lddubeau commented 8 years ago

Versions

Atomic 0.2.1 (this is the latest stable release at time of writing)

$ java -version
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b05)
OpenJDK 64-Bit Server VM (build 25.72-b05, mixed mode)

This is running on Debian testing.

Steps to reproduce

Download the GUM corpus in relANNIS format. Unzip it somewhere.
Unzip the Atomic package, cd into it.
./atomic
Select the menu File > Import Corpus.
A dialog will come up. In the pane that presents import methods select Corpus Import that appears under Atomic. Click Next.
Double-click RelANNISImporter among the choices of Pepper modules.
Double-click relANNIS version 3.1 among the formats, which is the only choice.
Set Target path is a to Directory, and select the directory for the GUM corpus you've unzipped somewhere in the first step. (That's the directory that contains the .tab files.) Click Next.
Click Next again without setting a property.
Give a unique project name. For instance, test1. Click Finish.
Expected results

Ideally, I'd expect the corpus to be imported.

If somehow I'm doing something I should not, then I'd expect Atomic to guide me towards using it properly, and I'd expect some informative error message to be provided by the GUI.

Actual results

There is a "Progress Information" dialog that comes up but never reaches completion. The corpus is not imported. The GUI does not help towards a resolution.

The console shows:

 [INFO ] 2015-12-09 10:45:21.884 [main] de.uni_jena.iaa.linktype.atomic.core.utils.AtomicProjectUtils: Added Atomic Project Nature to IProject.
Exception in thread "PepperModuleController[RelANNISImporter]" de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.exceptions.RelANNISModuleException: Cannot load corpus structure. Nested Exception is: Content is not allowed in prolog..
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNISImporter.importCorpusStructure(RelANNISImporter.java:88)
        at de.hu_berlin.german.korpling.saltnpepper.pepper.pepperFW.impl.PepperModuleControllerImpl.realImportCorpusStructure(PepperModuleControllerImpl.java:572)
        at de.hu_berlin.german.korpling.saltnpepper.pepper.pepperFW.impl.PepperModuleControllerImpl.run(PepperModuleControllerImpl.java:427)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.emf.ecore.resource.Resource$IOWrappedException: Content is not allowed in prolog.
        at org.eclipse.emf.ecore.xmi.impl.XMLLoadImpl.load(XMLLoadImpl.java:195)
        at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:240)
        at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1505)
        at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1284)
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNISImporter.importCorpusStructure(RelANNISImporter.java:85)
        ... 3 more
Caused by: org.xml.sax.SAXParseExceptionpublicId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; systemId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
        at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
        at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1437)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:999)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
        at org.eclipse.emf.ecore.xmi.impl.XMLLoadImpl.load(XMLLoadImpl.java:175)
        ... 7 more
[INFO ] 2015-12-09 10:50:15.638 [main] de.uni_jena.iaa.linktype.atomic.core.utils.AtomicProjectUtils: Added Atomic Project Nature to IProject.
Exception in thread "PepperModuleController[RelANNISImporter]" de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.exceptions.RelANNISModuleException: Some error occurs while traversing corpus structure graph. Nested Exception is: An exception occured while traversing graph 'null' .
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNIS2SaltMapper.mapRACorpusGraph2SCorpusGraph(RelANNIS2SaltMapper.java:166)
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNISImporter.importCorpusStructure(RelANNISImporter.java:93)
        at de.hu_berlin.german.korpling.saltnpepper.pepper.pepperFW.impl.PepperModuleControllerImpl.realImportCorpusStructure(PepperModuleControllerImpl.java:572)
        at de.hu_berlin.german.korpling.saltnpepper.pepper.pepperFW.impl.PepperModuleControllerImpl.run(PepperModuleControllerImpl.java:427)
        at java.lang.Thread.run(Thread.java:745)
Caused by: de.hu_berlin.german.korpling.saltnpepper.salt.graph.modules.exceptions.GraphModuleException: An exception occured while traversing graph 'null' 
        at de.hu_berlin.german.korpling.saltnpepper.salt.graph.modules.GraphTraverserObject.run(GraphTraverserObject.java:266)
        ... 1 more
Caused by: de.hu_berlin.german.korpling.saltnpepper.salt.graph.exceptions.GraphInsertException: Cannot add the given label 'de.hu_berlin.german.korpling.saltnpepper.salt.saltCore.impl.SMetaAnnotationImpl@aee9dded (namespace: null, name: version, value: 1.0.1)' object to LabelableElement 'salt:/GUM', because a label with this QName already exists: version
        at de.hu_berlin.german.korpling.saltnpepper.salt.graph.impl.LabelableElementImpl.addLabel(LabelableElementImpl.java:116)
        at de.hu_berlin.german.korpling.saltnpepper.salt.graph.impl.IdentifiableElementImpl.addLabel(IdentifiableElementImpl.java:140)
        at de.hu_berlin.german.korpling.saltnpepper.salt.saltCore.accessors.SMetaAnnotatableElementAccessor.addSMetaAnnotation(SMetaAnnotatableElementAccessor.java:38)
        at de.hu_berlin.german.korpling.saltnpepper.salt.saltCore.impl.SNodeImpl.addSMetaAnnotation(SNodeImpl.java:368)
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNIS2SaltMapper.mapRACorpus2SMetaAnnotation(RelANNIS2SaltMapper.java:225)
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNIS2SaltMapper.mapRACorpus2SCorpusSDocument(RelANNIS2SaltMapper.java:203)
        at de.hu_berlin.german.korpling.saltnpepper.pepperModules.relannis.RelANNIS2SaltMapper.nodeReached(RelANNIS2SaltMapper.java:578)
        at de.hu_berlin.german.korpling.saltnpepper.salt.graph.modules.GraphTraverserObject.depthFirstRec(GraphTraverserObject.java:292)
        at de.hu_berlin.german.korpling.saltnpepper.salt.graph.modules.GraphTraverserObject.run(GraphTraverserObject.java:259)
        ... 1 more

Observations

I have no experience with ANNIS or Atomic, or with the relANNIS format. I did read the documentation here. Importing the GUM corpus into ANNIS worked correctly on the first try. I pretty much used the method I described above, but with the modifications appropriate for ANNIS: I unzipped GUM somewhere, then I pointed ANNIS to the directory that contains all the .tab files, clicked the button to import and it worked! Then I tried the equivalent in Atomic and got nowhere fast. I've tried a bunch of variations, just in case:

Import corpus.tab instead of the directory. (As I said, I'm not familiar with the format, so I don't know if this makes sense at all. I've worked with other corpus formats that split a corpus among many files but have a file that serves as the "main" file of the corpus. A file named corpus.tab seemed a good candidate for this, so I tried.)
Import the zip file that contains the corpus instead of the directory. (Again, in other contexts, and other formats, this made sense.)
Just to see what would happen, I tried to load a CoNLL corpus (generated from the MASC 3.0 corpus using ANCTool 3.0.2).

Nothing above worked and the error messages left on the console were generally not helpful.

Surely there are things I tried that would appear obviously wrong from the point of view of someone who already knows how to load a relANNIS corpus into Atomic. I'd expect though that anything obviously wrong would be prevented by the GUI. For instance, if it is obviously wrong to try to load a relANNIS corpus as a single file, the option to load as a single file should just not be present on the screen. Or if only .tab files can be loaded, then I should not be able to select a .zip file.

Finally, I tried loading GUM in the PAULA format, and this worked, but it only adds to my puzzlement because I unzipped the zip file and pointed Atomic to the top directory that was created from unzipping the zip, which is essentially what I did with relANNIS. So I don't know why this worked.

FlorianZipser commented 8 years ago

Thank you for your very detailed report on that.

Looking into the exception stack lets me guess, it could be an encoding problem. The following line looks a bit like that:

Caused by: org.xml.sax.SAXParseExceptionpublicId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; systemId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

But nevertheless you got us. The RelANNISImporter was set as deprecated and probably does not work with the relANNIS version of the GUM corpus. We are sorry about that. In the next versions it will be removed entirely. I would propose to use the PAULAImporter, but you found that solution on your own ;-). Concerning the ConLL format, this is a bit tricky, since there are a lot of CoNLL formats outthere, which differ in the number of columns. That means unfortunately it depends on the version of the CoNLL format whether atomic (or Pepper) can deal with it. Could you send your CoNLL corpus to us under the following address:

saltnpepper@lists.hu-berlin.de

Thank you and best regards Florian

sdruskat commented 8 years ago

Let me add that the Atomic 1.0 release candidate will include only those Pepper modules that are also available in standalone Pepper, which means that deprecated modules won't be included. Also, it will include Pepper 3.x and Salt 3.x (currently Atomic uses 1.8) as well as provide bugfixes for showstoppers like #54.

lddubeau commented 8 years ago

On Wed, 2015-12-09 at 11:28 -0800, Florian Zipser wrote:

Looking into the exception stack lets me guess, it could be an encoding problem.

If you mean encoding like character encoding, I'm not seeing it:

$ file *
component.tab:         ASCII text
corpus_annotation.tab: HTML document, ASCII text, with very long lines
corpus.tab:            ASCII text
edge_annotation.tab:   ASCII text
example_queries.tab:   ASCII text
node_annotation.tab:   UTF-8 Unicode text
node.tab:              UTF-8 Unicode text
rank.tab:              ASCII text
resolver_vis_map.tab:  ASCII text
text.tab:              UTF-8 Unicode text, with very long lines

Those files that contain UTF-8 do not seem to start with anything unusual:

$ file * | grep UTF-8 | cut -d: -f1 | xargs -t -n1 od -N 8 -t x1
od -N 8 -t x1 node_annotation.tab 
0000000 30 09 47 55 4d 09 63 6c
0000010
od -N 8 -t x1 node.tab 
0000000 30 09 30 09 39 09 74 6f
0000010
od -N 8 -t x1 text.tab 
0000000 30 09 47 55 4d 2e 47 55
0000010

The sequence 30 09 is the character 0 followed by a tab. If the files started with a byte order mark, I'd say that could be the problem but I don't see anything amiss there.

But nevertheless you got us. The RelANNISImporter was set as deprecated and probably does not work with the relANNIS version of the GUM corpus. We are sorry about that. In the next versions it will be removed entirely.

Good to know.

I would propose to use the PAULAImporter, but you found that solution on your own ;-).

Indeed!

Concerning the ConLL format, this is a bit tricky, since there are a lot of CoNLL formats outthere, which differ in the number of columns. That means unfortunately it depends on the version of the CoNLL format whether atomic (or Pepper) can deal with it. Could you send your CoNLL corpus to us under the following address: saltnpepper@lists.hu-berlin.de

I'll pass on CoNLL because I don't actually plan to use this format, so I have no impetus to get it working and would not be particularly helpful to get to a resolution here. For all I know, maybe ANCTool 3.0.2 has a bug. It was more of a "let's try to see if this works" move on my part than anything else

If my plans change, I'll sure send you the data.

Now, the format I am really interested in, beyond all those I've already mentioned is GrAF. I see Pepper has an importer but I did not see it in Atomic, which is perhaps due to Atomic using an older version of Pepper? (By the way, this is why I was trying to load plugins and filed this, now closed, issue report: https://github.com/infraling/atomic/issues/52)

Thanks, Louis

infraling / atomic