segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
753 stars 44 forks source link

Failed to Adapt to your own corpus via LoRA #123

Closed 12eue closed 2 months ago

12eue commented 4 months ago

Hello, I tried to train a Lora using my own data, but the method in the "Adapt to your own corpus via LoRA" section in README.md was not successful, and there were problems including but not limited to the following:

  1. After cloning the code, there is no segment-any-text directory and adapters directory, so the second line "cd segment-any-text" and the 5th line "cd adapters" cannot be executed
  2. Running pip install -r requirements.txt in the python3.8 environment, the resulting packages conflict with each other
  3. Executing "python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json" failed, missing intrinsic.py file image Is there any way to fix this?
markus583 commented 4 months ago

Hi, thanks for raising this! This occurred when porting the research codebase into the library. To fix your issues:

  1. I updated the README.md accordingly. Please use this workflow to install the dependencies:
    git clone https://github.com/segment-any-text/wtpsplit
    cd wtpsplit
    pip install -r requirements.txt
    pip install adapters==0.2.1 --no-dependencies
    cd ..
  2. Please use Python 3.9. With the mentioned workflow, this should work now. I just tested it in WSL.
  3. I fixed some imports. Please update to the recent version (from source).

Also, lora_dummy_config.json should of course be changed. There are some templates we used successfully in our paper in configs/lora.

Hope this helps!

12eue commented 4 months ago

@markus583 I tried the method you mentioned above, but there are still several problems. For example, "from wtpsplit.evaluation.intrinsic_pairwise import generate_pairs, generate_k_mers, proc" in the "evaluate.py" file will report an error saying that "generate_pair" does not exist in intrinsic_pairwise.py. Even if I delete the relevant code that is meaningless to me to skip these problems, the initial SubwordXLMForTokenClassification object still fails because of the lack of init_adapters method. image

markus583 commented 3 months ago

I see, does this happen when you run wtpsplit/train/train_lora.py? Currently a bit pressed with other projects but I'll get back to this soon

12eue commented 3 months ago

I see, does this happen when you run wtpsplit/train/train_lora.py? Currently a bit pressed with other projects but I'll get back to this soon

Yes, I got the above error when running wtpsplit/train/train_lora.py. In the end, I made the following adjustments to make the program run normally:

  1. Modify the SubwordXLMRobertaModel class in wtpsplit/models.py, add self.__class__.__name__ = 'XLMRobertaModel' in the initfunction, so that the adapters init method can run through. Note that this modification must be effective when the adaptersversion is 0.2.1. I tried version 0.2.2 and it gave an error. This is not a good solution, just to temporarily solve my problem. I hope to see you update the code.
  2. In the wtpsplit/train/adaptertrainer.py file, because I am not running on TPU, the following code will not be executed:
    if is_torch_tpu_available(check_device=False):
    import torch_xla.core.xla_model as xm # noqa: F401
    import torch_xla.debug.metrics as met # noqa: F401
    import torch_xla.distributed.parallel_loader as pl # noqa: F401

    When executing the evaluation_loop function, an error will be reported, indicating that there is no'xm'variable. I temporarily deleted the relevant code to execute my code.

  3. The method of loading lora mentioned in the readme.md file is not correct. The style_or_domain and language parameters should be specified, otherwise it will not enter the branch condition for loading lora;
  4. The 'from wtpsplit.evaluation.intrinsic_pairwise import generate_pairs, generate_k_mers, process_logits_k_mers' method in the wtpsplit/train/evaluate.py class will report an error. I deleted the 'generate_pairs' to continue running
markus583 commented 3 months ago

Thanks for the detailed info! Really appreciate it. Currently I don't have access to a decent enough CUDA-powered device so I can't fully test this. But I pushed a version that should fix many of the mentioned issues.

As for the specifics:

  1. Quite surprised by this. I did not need this. Where exactly did you face this error?
  2. Should be fixed, but as I said, don't have access to a GPU.
  3. Indeed, thanks. I updated the code such that it also enters the loop when a lora_path is provided. So the README holds and it works as expected.
  4. Thanks again, I overlooked this one! Fixed now.
12eue commented 3 months ago

The first point mentioned above is that when I used my own data to train lora, an error occurred when executing python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json. The configuration file I used refers to lora.lyrics.json in the configs/lora directory. The specific error content is as follows: image

train_lora.py reports an error when executing the line adapters.init(backbone). Entering the model.py file of the adapters package, it can be found that the init function wraps the model and finally executes model.init_adapters(model.config, adapters_config). image

During the execution process, I found that because SubwordXLMForTokenClassification and SubwordXLMRobertaModel are not instances of the ModelAdaptersMixin class of the adapters package, the model is not wrapped, and the method is not found when the init_adapters function is finally executed.

I am not sure whether there is a problem with the initialization and configuration of train_lora.py when training the model, or there is a problem with the definition of SubwordXLMForTokenClassification and SubwordXLMRobertaModel in the model.py file in the wtpsplit package. I hope you can help check these two files to see if there is anything that needs to be updated in the repository.

markus583 commented 3 months ago

I see, thank you for your clear and detailed explanation. While looking through an older version of the research code, I realized that I used a custom version of the adapters library that adds a key for SubwordXLMForTokenClassification which makes init_adapters work again. In this repo, of course, we do not fully want to clone the adapters library. So what we do in wtpsplit/__init__.py is to monkey patch (L #509) so the init_adapters method works. I have not yet done this here for the adaptation. I will get back to this in September when I have both more time and a GPU to test this. Until then, you can try out this method or continue using the fix you mentioned.

markus583 commented 2 months ago

I just pushed the corresponding fixed and tested it on GPU, following the tutorial in the README. It works as expected now, so I will close this issue. If anything still comes up, feel free to comment here, I will re-open it.