Use GEC with latest transformer, allennlp modules

Jiltseb commented 3 years ago

I want to use GEC on a latest transformer model (v4.4.2). However, it has several module errors in gector and seems difficult to fix. I have tried using v1.5.0 of allennlp but was running into errors.

Note: There is no issue in getting the GEC work with the specified versions in requirements.txt. It is just that, I want to use it in a virtual environment with latest transformer/allennlp versions.

Any help is highly appreciated! @skurzhanskyi

abhinavdayal commented 3 years ago

I also tried with the latest versions. It seems a lot of code is using deprecated functionality that need to be re-written.

Jiltseb commented 3 years ago

@skurzhanskyi any news on this? We are still stuck with this.

skurzhanskyi commented 3 years ago

Hi @Jiltseb We have plans to update transformer this months

ierezell commented 3 years ago

Hi, any update on this ?

I had to change the code to make it fit with new allennlp (I can do a PR if needed), but I'm still facing many issues while loading models or running predicitons.

I tried both 3 pretrained models and cannot make any of them to work....

Thanks in advance

skurzhanskyi commented 3 years ago

Hi, there's a branch with the transformers==4.2.2. You can check it here: https://github.com/grammarly/gector/tree/update_transformers_support_fasttokenizers At the same time, pretrained models produce pure output with this code. we're in the middle of retraining the models.

ierezell commented 3 years ago

Hi @skurzhanskyi, I just tried but unfortunately, I got the same errors...

I'm trying to use directly the GecBERTModel class to integrate it into my code. I have a higher version of allennlp but I modified imports to work with it, however most of the errors come from missing keys or bad loading of the models.

I will wait for the new release.

Have a great day

ierezell commented 3 years ago

Hi, sorry to post again...

I managed to make it work with transformer 4.6.1 and allennlp 2.6.0.

However, the output of handle_batch(my_string_sentence.split()) doesn't correct anything.... Like :

handle_batch("How ar you my firend ?".split())
[['How', 'ar', 'you', 'my', 'firend', '?']] 0

To do this I removed some @override decorators, and added the function

def as_padded_tensor_dict(
        self,
        tokens: Dict[str, List[int]],
        padding_lengths: Dict[str, int],
    ) -> Dict[str, List[int]]:
        return {
            "input_ids": torch.tensor(tokens["bert"]),
            "offsets": torch.tensor(tokens["bert-offsets"]),
        }

in tokenizer_indexers.py which replaces the old pad_token_sequence as it seems.

I also add to remove the mask in the seq2labels_model as it was always true and not of the same size...

I know I did some hackish things and was hoping it could work as I don't know the codebase.

I hope it can help you and that we can have a really nice open-source state of the art grammar corrector (which we can train in other languages) :)

Have a great day

Jiltseb commented 3 years ago

@skurzhanskyi Any update on the release with the new retrained models?

skurzhanskyi commented 2 years ago

Hi @Jiltseb Sorry for the late reply. Unfortunately, we've got problems getting the same quality of models with the branch code. So we cannot move to it completely. In case you don't need the pretrained model, you can try using this branch.

skurzhanskyi commented 2 years ago

Hi @Jiltseb @Ierezell @abhinavdayal We have great news, we just merged https://github.com/grammarly/gector/pull/133 new GECToR version, which now supports the latest transformers & torch, there're also new pretrained models (BERT, RoBERTa, XLNet). The scores are slightly different but still comparable.

grammarly / gector

Use GEC with latest transformer, allennlp modules #98