Clarification on Preprocessing

wgantt commented 3 years ago

Hi Patrick,

(tagging @WeiweiGu1998, who's also interested) I was hoping to get clarification on your preprocessing steps for OntoNotes. You refer the reader to this repo for preprocessing, which requires running the setup_training.sh script there. This in turn calls minimize.py. I gather we should be using your minimize.py script that you provide, not the one in that repo. Is that right? So in other words, setup_training.sh should really be calling

python /path/to/incremental-coref/conversion_scripts/minimize.py $vocab_file $data_dir $data_dir false "bert"

rather than

python /path/to/mandarjoshi90/coref/minimize.py $vocab_file $data_dir $data_dir false

is this correct?

In general, any additional clarity on exactly what you intend for preprocessing would be appreciated.

pitrack commented 3 years ago

Both scripts are intended to do the same thing on OntoNotes data, there are 2ish differences.

Tokenizer. The script in their repo uses the one from the official bert repo, the one here uses transformers. As far as I know, both are tokenizers for BERT so the output should be identical. I believe I checked that they were identical on OntoNotes.
Segmenting. Their script segments by token boundaries (128, 256, 384, 512). The one here retains this ability but also has the ability to segment by sentence boundaries. Unfortunately this was hardcoded in, and the current copy is the one where it's segmented by sentences ([1,3,5,10]). To reproduce some of the results (Table 4) in the paper, you would need to segment this way.
There's a benign typo in initializing a default language variable because it's spelled self.langauge instead of self.language. This shouldn't affect anything since document_state.language is correctly set later.

So, the easiest preprocessing would be to run their script to get #token based segmented output (4 sets of (train, dev, test) files for English) and run the one here to get #sentence based segmented output (another 4 sets of (train, dev, test) files). Or, you could modify the script here (by toggling sentences=True to False at L193) and changing ([1,3,5,10]) to number of tokens.

Hopefully this is clear, feel free to ask any follow-ups!

wgantt commented 3 years ago

Okay, this is more or less what I figured. Thank you!

I do have one follow-up for now, which concerns minimize_json.py. It appears to be similar to minimize.py, except that it's expecting JSON as input, rather than CoNLL. Just wanted to know 1) what these JSON files are supposed to be and whether we need to be running this script as well or whether minimize.py suffices.

pitrack commented 3 years ago

I don't think minimize_json.py was needed for this project/paper. It can be useful for other datasets (since not all datasets come in CoNLL format), and it was occasionally useful for debugging too.

wgantt commented 3 years ago

Great, thanks!

pitrack / incremental-coref

Clarification on Preprocessing #2