Share the finetuned model to the world

LifeIsStrange commented 1 year ago

Hi @vdobrovolskii friendly ping :) Thank you for this excellent project, which I would like to use. Unfortunately if I understand correctly the pre trained model you link to download is insufficient and manual training is necessary even for doing prediction on random text. If so then please it would be so kind to make the finetuned model available to download! You just have to train once (and probably already have the finetuned model) and to upload it on Google drive/torrent/Dropbox or any free cloud service.

My hardware hasn't enough ram to do the training :/

vdobrovolskii commented 1 year ago

Hi! Could you please elaborate on what the difference should be between the model that is already available for downloading and the one that you would like to have?

LifeIsStrange commented 1 year ago

Hi :) I just want to be able to predit AKA use the fine tuned model on new text from the web If I run

python predict.py roberta sample_input.jsonlines output.jsonlines

it error: FileNotFoundError: [Errno 2] No such file or directory: 'data/english_train_head.jsonlines'

Therefore for some unknown reason the pretrained (fine tuned?) model needs me to run the preparation step I tried to download the dataset like you said here https://catalog.ldc.upenn.edu/LDC2013T19 And I created an account, loged in but I can't find the download button lol All I see is There is no clickable link, unless I'm blind :) there doesn't seem to be a way to download, is the webpage broken?

Then beyond that issue, if I understand what your'e implying, the model to download is already fine tuned and I don't need to launch the train script. If so my bad I just didn't understand why having the ontonote corpus is necessary for prediction UPDATE: I don't have an organization and therefore it seems I can't access the dataset... BUT I shouldn't need the dataset with a finetuned model, exactly the point of my issue Then why am I hiting FileNotFoundError: [Errno 2] No such file or directory: 'data/english_train_head.jsonlines'

BTW thanks a lot for your availability you are amazing!

vdobrovolskii commented 1 year ago

Ok, I understand! Let me see what I can do (it should be enough to modify the initialization code)

vdobrovolskii commented 1 year ago

I pushed a change that addresses this issue. Now you shouldn't need to have OntoNotes to use predict.py.

LifeIsStrange commented 1 year ago

Welcome back :) I pulled ur change and now I get the following error:

 py -3.7 predict.py roberta sample_input.jsonlines output.jsonlines
Loading roberta-large...                          
Using tokenizer kwargs: {'add_prefix_space': True}
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.43G/1.43G [13:43<00:00, 1.73MB/s]
Bert successfully loaded.
Loading from data\roberta_(e20_2021.05.02_01.16)_release.pt...
Traceback (most recent call last):
  File "predict.py", line 63, in <module>
    "bert_scheduler", "general_scheduler"})
  File "C:\Users\steph\PycharmProjects\wl-coref\coref\coref_model.py", line 201, in load_weights
    self.trainable[key].load_state_dict(state_dict)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaModel:
        Unexpected key(s) in state_dict: "embeddings.position_ids".

vdobrovolskii commented 1 year ago

just worked for me...

(wl_coref) vladimir@MSI:~/wl-coref$ python predict.py roberta sample_input.jsonlines output.jsonlines
Loading roberta-large...
Using tokenizer kwargs: {'add_prefix_space': True}
Bert successfully loaded.
Loading from data/roberta_(e20_2021.05.02_01.16)_release.pt...
Loaded bert
Loaded we
Loaded rough_scorer
Loaded pw
Loaded a_scorer
Loaded sp
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.13s/docs]

I see you're on Windows, I wonder if it's related.

What is your transformers version?

transformers==3.2.0

LifeIsStrange commented 1 year ago

re @vdobrovolskii I have recloned the project, it appears I had some mixed python 3.10 dependencies in cache or something like that

Now I no longer have this error but I have a new one later in the process:

PS C:\Users\steph\PycharmProjects\wl-coref-2> py -3.7 predict.py roberta sample_input.jsonlines output.jsonlines
Loading roberta-large...
Using tokenizer kwargs: {'add_prefix_space': True}
Bert successfully loaded.
Loading from data\roberta_(e20_2021.05.02_01.16)_release.pt...
Loaded bert
Loaded we
Loaded rough_scorer
Loaded pw
Loaded a_scorer
Loaded sp
  0%|                                                                                                                                                          | 0/2 [01:00<?, ?docs/s]
Traceback (most recent call last):
  File "predict.py", line 71, in <module>
    result = model.run(doc)
  File "C:\Users\steph\PycharmProjects\wl-coref-2\coref\coref_model.py", line 220, in run
    words, cluster_ids = self.we(doc, self._bertify(doc))
  File "C:\Users\steph\PycharmProjects\wl-coref-2\coref\coref_model.py", line 354, in _bertify
    attention_mask, device=self.config.device))
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\modeling_roberta.py", line 674, in forward
    return_dict=return_dict,
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\modeling_roberta.py", line 280, in forward
    output_attentions,
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\modules\module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Users\steph\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\modeling_roberta.py", line 197, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: cublas runtime error : the GPU program failed to execute at C:/w/1/s/windows/pytorch/aten/src/THC/THCBlas.cu:368

edit I currently have torch 1.4.0+cu92 I'm going to try with a newer cuda given I have an RTX3000 gpu

vdobrovolskii commented 1 year ago

You can try installing a more stable version of torch (that would be 1.8.2, I think)

Also you can try changing the device to cpu in config.toml to confirm it's working

LifeIsStrange commented 1 year ago

re, Like you said, as a workaround I replaced "cuda:0" by "cpu" and it works!

Although I'm not sure how to interpret the results..

For example, here it seems to be working for ["Bob", "is", "kind", "but", "he", "is", "dangerous", "."] I get "word_clusters": [[0, 4]] Which links the following entities: Bob and he, which is a correct coreference resolution But I don't understand what "span_clusters": [[[0, 1], [4, 5]]] is supposed to be

Also for the default example you give, I don't understand the result ["Hi", ",", "my", "name", "is", "Tom", ".", "I", "am", "five", "."] "word_clusters": [[2, 7]] It seems to link "my" with "I" which IMHO is incorrect, the coreferences should be "Tom" and "I", no?

as for ["Because", "Joseph", "her", "husband", "was", "faithful", "to", "the", "law,", "and", "yet", "did", "not", "want", "to", "expose", "her", "to", "public", "disgrace,", "he", "had", "in", "mind", "to", "divorce", "her", "quietly", "."] I do understand the word_clusters output

but my most important question is, what do you think is the most accurate document_id when used for e.g. on wikipedia pages would it be wb: web data for max accuracy? also do you believe the model which is trained on CONLL generalize to out of domain standard english such as wikipedia pages? Would the 81% accuracy be preserved? Would it be even higher assuming conll is made with harder inputs than average text? Worse? Any opinion? again thanks a lot for the fix :)

vdobrovolskii commented 1 year ago

span_clusters shows the same entities as word_clusters, but instead of indices of words, you will get the indices of word spans.

The [[2, 7]] result links my and I, which is expected. Tom here is not referring to an entity, rather, it just names it. See the OntoNotes annotation guidelines to dive deeper into this.

The span_clusters for the same example will be [[[2, 3], [7, 8]]], which means "words from 2nd up to 3rd" link to "words from 7th up to 8th", which is equal to my and I.

For Wikipedia I'd say wb should be the best match, I think, even though it's not 100%. The performance on out-of-domain data will drop inevitably, but it is difficult to say by how much, one would need to consider the differences between the domains. You can also try some domain adaptation techniques on your data, the exact techniques will depend on whether you have some annotated data in your target domain or not.

LifeIsStrange commented 1 year ago

Thank you I understand now :)

Tom here is not referring to an entity, rather, it just names it

Hum that's a bad news for me.. I wanted to use coreference resolution to expand pronouns into nouns in text. Which allow me to further analyze the text for semantic parsing.

I can use named entity recognition to detect that Tom is a person but I still have to manually connect it to the coreferences.

You can also try some domain adaptation techniques on your data

What would be such unsupervised(?) technique assuming I don't have labels and I'm kind of a noob at ML techniques?

vdobrovolskii commented 1 year ago

Hum that's a bad news for me.. I wanted to use coreference resolution to expand pronouns into nouns in text. Which allow me to further analyze the text for semantic parsing.

It should not normally be an issue, unless you have a lot of texts where you have those "I am X"/"He is X" in direct speech and there are no referrals to the same entity using "X" later on. If it is an issue, you can do some post-processing with the text, I'd imagine, you can parse it with spacy and then manually add such mentions to your entities if you encounter something like "ENTITY + to be + X".

What would be such unsupervised(?) technique assuming I don't have labels and I'm kind of a noob at ML techniques?

I'd say, start with this paper, then you'll be able to find your way once you understand what options are available for your particular case.

vdobrovolskii commented 1 year ago

Also, if you use spacy, you may also want to use their implementation of coreference resolution system, which is based on the approach from this repo as well: https://spacy.io/api/coref

vdobrovolskii / wl-coref

Share the finetuned model to the world #43