Closed wgantt closed 3 years ago
Both scripts are intended to do the same thing on OntoNotes data, there are 2ish differences.
Tokenizer. The script in their repo uses the one from the official bert repo, the one here uses transformers
. As far as I know, both are tokenizers for BERT so the output should be identical. I believe I checked that they were identical on OntoNotes.
Segmenting. Their script segments by token boundaries (128, 256, 384, 512). The one here retains this ability but also has the ability to segment by sentence boundaries. Unfortunately this was hardcoded in, and the current copy is the one where it's segmented by sentences ([1,3,5,10]). To reproduce some of the results (Table 4) in the paper, you would need to segment this way.
There's a benign typo in initializing a default language
variable because it's spelled self.langauge
instead of self.language
. This shouldn't affect anything since document_state.language
is correctly set later.
So, the easiest preprocessing would be to run their script to get #token based segmented output (4 sets of (train, dev, test) files for English) and run the one here to get #sentence based segmented output (another 4 sets of (train, dev, test) files). Or, you could modify the script here (by toggling sentences=True
to False at L193) and changing ([1,3,5,10]) to number of tokens.
Hopefully this is clear, feel free to ask any follow-ups!
Okay, this is more or less what I figured. Thank you!
I do have one follow-up for now, which concerns minimize_json.py
. It appears to be similar to minimize.py
, except that it's expecting JSON as input, rather than CoNLL. Just wanted to know 1) what these JSON files are supposed to be and whether we need to be running this script as well or whether minimize.py
suffices.
I don't think minimize_json.py
was needed for this project/paper. It can be useful for other datasets (since not all datasets come in CoNLL format), and it was occasionally useful for debugging too.
Great, thanks!
Hi Patrick,
(tagging @WeiweiGu1998, who's also interested) I was hoping to get clarification on your preprocessing steps for OntoNotes. You refer the reader to this repo for preprocessing, which requires running the
setup_training.sh
script there. This in turn callsminimize.py
. I gather we should be using yourminimize.py
script that you provide, not the one in that repo. Is that right? So in other words,setup_training.sh
should really be callingrather than
is this correct?
In general, any additional clarity on exactly what you intend for preprocessing would be appreciated.