Closed beneyal closed 5 years ago
I got the error above in two cases: when there are empty lines in the input file (so I got rid of them), and again immediately after getting
ERROR: sentence length mismatches token number in Stanford annotation
, maybe it has something to do with one of the words in that sentence being "voila" with an accented letter "a".
I have not been able to replicate the latter problem. Can you give the full sentence?
Is there a flag I can pass so that the pipeline will silently ignore such errors?
There is no flag, but all errors are written to STDERR, so you can ignore them by redirecting to /dev/null, like this: "sh ... 2> /dev/null".
On the same note, I've 23M sentences to label - do you think it's better to split them to N files and run N processes for
parse_fn.sh
, or I should stick to my current 1 file with 23M sentences?
Personally, I would split the file into 23 (or more) smaller files. Otherwise, the process might be running for a month(?).
Cheers, Michael
I got the error above in two cases: when there are empty lines in the input file (so I got rid of them), and again immediately after getting
ERROR: sentence length mismatches token number in Stanford annotation
, maybe it has something to do with one of the words in that sentence being "voila" with an accented letter "a".Is there a flag I can pass so that the pipeline will silently ignore such errors? On the same note, I've 23M sentences to label - do you think it's better to split them to N files and run N processes for
parse_fn.sh
, or I should stick to my current 1 file with 23M sentences?Thanks!