It is not possible to run encoder-decoder using Oracle

shaked571 commented 2 months ago

I am openiing it as a new issue because I feel it is although related problem to the finetuning but it is suppused to be a seperated one - feel free to close it and continue the thread in the original one

I tried to run it but I get a formatting error when I run with

python main.py --model_path oracle  \
               --max_input_size 3000   \
               --output_dir output_oracle   \
               --split test  \ 
               --batch_size 4 \
               --dataset_name preco  \
               --no_pound_symbol  \
               --subset 10   
               --subset_start 0

seem like preco daraset doesnt fit the scheme and it is missing the field 'coref_spans' (as well all the othe open datasets ), unfortunately I dont have access to the default dataset (raised a request via hugginface) so when I execute load_dataset('coref-data/conll2012_indiscrim', 'english_v4') it fails and I get an axception as well.

How exactly I should run main.py?

--> alternatively, can you share a code that transform ontonote data (from any format) into link append format? because Im guessing you wouldnt openup ontonote data beacsue of its license, but because I have it myself, I would be able to conevert it to the required format (I believe you called it llm-coref/mt5-coref-ontonotes )

p.s sorry for the delayed answer needed to setup an a100 machine to get things running

Originally posted by @shaked571 in https://github.com/ianporada/coref-reeval/issues/1#issuecomment-2094253451

ianporada commented 2 months ago

This should now be fixed in the most recent commit.

We originally wrote the code using the ontonotes format used at https://huggingface.co/datasets/conll2012_ontonotesv5 but since then I've reformatted all datasets to use a consistent format so that the code can work with any dataset. Reformatted datasets are available at https://huggingface.co/collections/coref-data/indiscriminate-identity-coreference-65a7f336c46ce42ef5655570 I had updated the inference code to work with this new format but had never updated the oracle generation code. I've now updated it, so oracle generation should work (just tested it on my end for preco). For the oracle generation a GPU is not necessary. You're right, because OntoNotes is LDC license we cannot redistribute it and therefore cannot make coref-data/conll2012_indiscrim public, but you can request gated access since you already have access to the dataset. In any event, I plan to release all the code for reformatting the datasets. I'll leave this issue open and add a comment here when that code is public.

shaked571 commented 2 months ago

Thanks! I see you aready gave me premissions, so I would try it asap.

I think that the code that let you reformat the code can be useful.

Becuase I think that by allowing people tranform their own data to pne consistent format, it would make ot eaiser to compare the models apples to apples.

In my case for example, It is even harder in the current setting because my paper deals with a languge that doesnt even has a coref dataset. So im just usning these sota models in order to supply a baseline, which was not too calthontrolled beacsue every code based chose diffrent hyper parameter tuning I had followed (with minor fixes because my dataset is much smaller than ontonote

If researched would have a one-stop-shop to evalute their new dataset it would be great.

shaked571 commented 2 months ago

Hey @ianporada,

Is there any progress in uploading the code or should I maybe try to implement it myself? I still have some time to finish the annotationsso I waited, but I would like to run it in a couple of weeks. I just would like to know how you would estimate it. Thanks in advance for your effort

ianporada commented 2 months ago

Sorry @shaked571 I didn't realize you were directly waiting on this! I've made the repo public. A description of the final format is available at https://github.com/ianporada/coref-data/blob/main/README.md and here is the conversion script for conll2012: https://github.com/ianporada/coref-data/blob/main/preprocessing/indiscrim_conversion/conll2012.py

shaked571 commented 2 months ago

Thank a lot!

Yes, I want to add the model to my paper. and its the only one I haven't implemented yet!

Hope your paper was accepted to the upcoming conferences :)

ianporada commented 1 month ago

I'll try to make an example jupyter notebook of all the steps to make it easier to follow.

ianporada commented 1 month ago

I just had some time to make a quick notebook showing how to generate the training data: https://colab.research.google.com/drive/11VQh2Hyq7Qz1yJFNSWT_grYxA7W0mTWT?usp=sharing Does this make it more clear? I can also show how to then finetune the model on this data.

shaked571 commented 1 month ago

Sorry, forgot to answer! Yes, it is made sense (I reopened and closed the issue after I figured out what I did wrong)

Thanks for all the help

ianporada / coref-reeval

It is not possible to run encoder-decoder using Oracle #2