microth / PathLSTM

Neural SRL model
71 stars 14 forks source link

ArrayIndexOutOfBoundsException with parse_fn.sh #22

Closed beneyal closed 5 years ago

beneyal commented 7 years ago

I got the error above in two cases: when there are empty lines in the input file (so I got rid of them), and again immediately after getting ERROR: sentence length mismatches token number in Stanford annotation, maybe it has something to do with one of the words in that sentence being "voila" with an accented letter "a".

Is there a flag I can pass so that the pipeline will silently ignore such errors? On the same note, I've 23M sentences to label - do you think it's better to split them to N files and run N processes for parse_fn.sh, or I should stick to my current 1 file with 23M sentences?

Thanks!

microth commented 7 years ago

I got the error above in two cases: when there are empty lines in the input file (so I got rid of them), and again immediately after getting ERROR: sentence length mismatches token number in Stanford annotation, maybe it has something to do with one of the words in that sentence being "voila" with an accented letter "a".

I have not been able to replicate the latter problem. Can you give the full sentence?

Is there a flag I can pass so that the pipeline will silently ignore such errors?

There is no flag, but all errors are written to STDERR, so you can ignore them by redirecting to /dev/null, like this: "sh ... 2> /dev/null".

On the same note, I've 23M sentences to label - do you think it's better to split them to N files and run N processes for parse_fn.sh, or I should stick to my current 1 file with 23M sentences?

Personally, I would split the file into 23 (or more) smaller files. Otherwise, the process might be running for a month(?).

Cheers, Michael