Open alexandreteles opened 3 months ago
Hi @alexandreteles, thank you for your interest in our project.
In fact, we have released the entire data collection pipeline and scripts at https://github.com/microsoft/LLMLingua/tree/main/experiments/llmlingua2/data_collection. You can define your own compressor based on this. Just due to the review process, the open-sourcing of the dataset has been delayed. Once it's approved, we will release it at this HF link.
Describe the issue
Greetings,
Are there any plans on releasing instructions or at least the dataset format so we can fine-tune the
llmlingua-2-xlm-roberta-large-meetingbank
or the basexlm-roberta-large
into a custom compressor? If not, can you at least give some general instructions on how could we approach this issue?Of course having a pipeline ready to simply plug the data and fine-tune the models would be amazing for simplicity sake, but it would be nice if we had more generalist and practical information on the process.
Thank you!