write functions to tokenize text and annotations

ncsa / empirical-ip-law

This repository is for the IC-377 research

0 stars 0 forks source link

write functions to tokenize text and annotations #5

Closed WeihaoGe1009 closed 1 week ago

WeihaoGe1009 commented 1 month ago

write functions use AutoTokenizer to process texts with models from hugging face.

WeihaoGe1009 commented 1 month ago

prototype written, need to install more packages and perform some test runs. need to test with different models.

[ ] bert-basic
- [ ] create tokens
- [ ] process format to match shapes for train_test_split
- [ ] dataset stored
[ ] legal-bert
- [ ] create tokens
- [ ] process format to match shapes for train_test_split
- [ ] dataset stored
[ ] llama-3.1-8B
- [ ] create tokens
- [ ] process format to match shapes for train_test_split
- [ ] dataset stored

WeihaoGe1009 commented 1 month ago

writing a general function with AutoTokenizer, AutoModelForCausalLM. Not sure if specific tokenizer and model functions will work better. start with general function at this stage LLMTokenizater does not recognize llama-3.8-1 model.

WeihaoGe1009 commented 1 week ago

bert-basic has limited token size, and cannot handle the issue with "unmatched tokens" for input and output when testing tokens with some small prompting example code. after discussion with David Bianchi, this looks like a tensor-flow issue. need to look into it so that we can still utilize legal-bert later.

WeihaoGe1009 commented 1 week ago

things to do next:

test long-bert models for tokenization
explore ways to tokenize the texts in 512 token units.