Open lddubeau opened 8 years ago
Thank you for your very detailed report on that.
Looking into the exception stack lets me guess, it could be an encoding problem. The following line looks a bit like that:
Caused by: org.xml.sax.SAXParseExceptionpublicId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; systemId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
But nevertheless you got us. The RelANNISImporter was set as deprecated and probably does not work with the relANNIS version of the GUM corpus. We are sorry about that. In the next versions it will be removed entirely. I would propose to use the PAULAImporter, but you found that solution on your own ;-). Concerning the ConLL format, this is a bit tricky, since there are a lot of CoNLL formats outthere, which differ in the number of columns. That means unfortunately it depends on the version of the CoNLL format whether atomic (or Pepper) can deal with it. Could you send your CoNLL corpus to us under the following address:
saltnpepper@lists.hu-berlin.de
Thank you and best regards Florian
Let me add that the Atomic 1.0 release candidate will include only those Pepper modules that are also available in standalone Pepper, which means that deprecated modules won't be included. Also, it will include Pepper 3.x and Salt 3.x (currently Atomic uses 1.8) as well as provide bugfixes for showstoppers like #54.
On Wed, 2015-12-09 at 11:28 -0800, Florian Zipser wrote:
Looking into the exception stack lets me guess, it could be an encoding problem.
If you mean encoding like character encoding, I'm not seeing it:
$ file *
component.tab: ASCII text
corpus_annotation.tab: HTML document, ASCII text, with very long lines
corpus.tab: ASCII text
edge_annotation.tab: ASCII text
example_queries.tab: ASCII text
node_annotation.tab: UTF-8 Unicode text
node.tab: UTF-8 Unicode text
rank.tab: ASCII text
resolver_vis_map.tab: ASCII text
text.tab: UTF-8 Unicode text, with very long lines
Those files that contain UTF-8 do not seem to start with anything unusual:
$ file * | grep UTF-8 | cut -d: -f1 | xargs -t -n1 od -N 8 -t x1
od -N 8 -t x1 node_annotation.tab
0000000 30 09 47 55 4d 09 63 6c
0000010
od -N 8 -t x1 node.tab
0000000 30 09 30 09 39 09 74 6f
0000010
od -N 8 -t x1 text.tab
0000000 30 09 47 55 4d 2e 47 55
0000010
The sequence 30 09 is the character 0 followed by a tab. If the files started with a byte order mark, I'd say that could be the problem but I don't see anything amiss there.
But nevertheless you got us. The RelANNISImporter was set as deprecated and probably does not work with the relANNIS version of the GUM corpus. We are sorry about that. In the next versions it will be removed entirely.
Good to know.
I would propose to use the PAULAImporter, but you found that solution on your own ;-).
Indeed!
Concerning the ConLL format, this is a bit tricky, since there are a lot of CoNLL formats outthere, which differ in the number of columns. That means unfortunately it depends on the version of the CoNLL format whether atomic (or Pepper) can deal with it. Could you send your CoNLL corpus to us under the following address: saltnpepper@lists.hu-berlin.de
I'll pass on CoNLL because I don't actually plan to use this format, so I have no impetus to get it working and would not be particularly helpful to get to a resolution here. For all I know, maybe ANCTool 3.0.2 has a bug. It was more of a "let's try to see if this works" move on my part than anything else
If my plans change, I'll sure send you the data.
Now, the format I am really interested in, beyond all those I've already mentioned is GrAF. I see Pepper has an importer but I did not see it in Atomic, which is perhaps due to Atomic using an older version of Pepper? (By the way, this is why I was trying to load plugins and filed this, now closed, issue report: https://github.com/infraling/atomic/issues/52)
Thanks, Louis
Versions
Atomic 0.2.1 (this is the latest stable release at time of writing)
This is running on Debian testing.
Steps to reproduce
./atomic
File > Import Corpus
.Corpus Import
that appears underAtomic
. ClickNext
.RelANNISImporter
among the choices of Pepper modules.relANNIS
version 3.1 among the formats, which is the only choice.Target path is a
toDirectory
, and select the directory for the GUM corpus you've unzipped somewhere in the first step. (That's the directory that contains the.tab
files.) ClickNext
.Next
again without setting a property.test1
. ClickFinish
.Expected results
Ideally, I'd expect the corpus to be imported.
If somehow I'm doing something I should not, then I'd expect Atomic to guide me towards using it properly, and I'd expect some informative error message to be provided by the GUI.
Actual results
There is a "Progress Information" dialog that comes up but never reaches completion. The corpus is not imported. The GUI does not help towards a resolution.
The console shows:
Observations
I have no experience with ANNIS or Atomic, or with the relANNIS format. I did read the documentation here. Importing the GUM corpus into ANNIS worked correctly on the first try. I pretty much used the method I described above, but with the modifications appropriate for ANNIS: I unzipped GUM somewhere, then I pointed ANNIS to the directory that contains all the
.tab
files, clicked the button to import and it worked! Then I tried the equivalent in Atomic and got nowhere fast. I've tried a bunch of variations, just in case:corpus.tab
instead of the directory. (As I said, I'm not familiar with the format, so I don't know if this makes sense at all. I've worked with other corpus formats that split a corpus among many files but have a file that serves as the "main" file of the corpus. A file namedcorpus.tab
seemed a good candidate for this, so I tried.)Nothing above worked and the error messages left on the console were generally not helpful.
Surely there are things I tried that would appear obviously wrong from the point of view of someone who already knows how to load a relANNIS corpus into Atomic. I'd expect though that anything obviously wrong would be prevented by the GUI. For instance, if it is obviously wrong to try to load a relANNIS corpus as a single file, the option to load as a single file should just not be present on the screen. Or if only
.tab
files can be loaded, then I should not be able to select a.zip
file.Finally, I tried loading GUM in the PAULA format, and this worked, but it only adds to my puzzlement because I unzipped the zip file and pointed Atomic to the top directory that was created from unzipping the zip, which is essentially what I did with relANNIS. So I don't know why this worked.