uwnlp / neuralccg

Codebase for Global Neural CCG Parsing with Optimality Guarantees
Apache License 2.0
25 stars 3 forks source link

instructions for ccgbank data gathered from disc? #3

Open johnvblazic opened 7 years ago

johnvblazic commented 7 years ago

Hi,

Our university has the disc copies of the CCG bank and I don't have access to the online versions of the data. I pulled the data from the call signature that appears in the link, and the data that I've gathered appears to be the same format as the sample provided in the link. I can't tell from the code or the readme what the directory structure of "ccgbank_1_1" is. So far, I've tried putting the "data" directory that I found in the ccgbank downlown in that directory, I have also tried putting the AUTO/HTML/LEX/PARG/RAW directories in the ccgbank_1_1 directory as well.

Any guidance you could provide would be extremely helpful.

I'm consistently getting the following error:

12:54:10 | ERROR | c.g.k.p.core.Stage | Job failed. java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) ~[na:1.8.0_121] at edu.uw.easysrl.corpora.CCGBankDependencies.getDependencyParseCCGBank(CCGBankDependencies.java:386) ~[EasySRL-d69cb6e7d99595372df8dda65b7e975b21f18c37.jar:na] at edu.uw.easysrl.corpora.CCGBankDependencies.getDependencyParses(CCGBankDependencies.java:364) ~[EasySRL-d69cb6e7d99595372df8dda65b7e975b21f18c37.jar:na] at edu.uw.easysrl.corpora.CCGBankDependencies.loadCorpus(CCGBankDependencies.java:349) ~[EasySRL-d69cb6e7d99595372df8dda65b7e975b21f18c37.jar:na] at edu.uw.neuralccg.task.CCGBankReaderTask.parseStream(CCGBankReaderTask.java:19) ~[classes/:na] at edu.uw.neuralccg.task.CCGBankReaderTask.run(CCGBankReaderTask.java:34) ~[classes/:na] at com.github.kentonl.pipegraph.core.Stage.run(Stage.java:195) ~[pipegraph-bb781b4c3496e98c337a030d98b81f31490ab0f4.jar:na] at com.github.kentonl.pipegraph.runner.AsynchronousPipegraphRunner.run(AsynchronousPipegraphRunner.java:43) [pipegraph-bb781b4c3496e98c337a030d98b81f31490ab0f4.jar:na] at com.github.kentonl.pipegraph.runner.AsynchronousPipegraphRunner.lambda$null$1(AsynchronousPipegraphRunner.java:61) [pipegraph-bb781b4c3496e98c337a030d98b81f31490ab0f4.jar:na] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_121]

johnvblazic commented 7 years ago

I've checked the pipegraph logs, the pipegraph code, the EasySRL code, and the .conf files for the neuralccg project, but I can't find any reference to the file path that it is failing on other than /data/ccgbank_1_1 in the .conf file

kentonl commented 7 years ago

The files should be set up such that the famous Pierre Vinken example (first sentence of the dev set) can be found via this path: neuralccg/data/ccgbank_1_1/data/AUTO/00/wsj_0001.auto

Does this match something you've tried?

johnvblazic commented 7 years ago

yeah, that was where i've started and i've been trying permutations since. the demo works just fine, i'm currently trying to get the training module running with the following command,

./run.sh experiments/train.conf train 8080

johnvblazic commented 7 years ago

Is there any way I can find the file path it is failing on?

kentonl commented 7 years ago

It looks like the failure is happening here: https://github.com/kentonl/EasySRL/blob/maven/src/edu/uw/easysrl/corpora/CCGBankDependencies.java#L386

There is likely a mismatch between the content of the online version and the disc version of CCGBank. You can debug and/or apply temporary fixes by cloning the maven branch of https://github.com/kentonl/EasySRL. After running mvn install with local edits, neuralccg should use the updated code.