Copy and try out Tokenizer code

jjm82 commented 12 months ago

new high scores are achieved by this code: https://www.kaggle.com/code/hubert101/0-960-phrases-are-keys/notebook (it still uses tfidf vectorizer but also splits up text by tokens, which is how GPT's split up text)
copy and paste into main
download datasets used and make sure paths agree
in the last main block of code, where everything runs, create a variable model="model_name" and call the old model something and the new model something and put the corresponding blocks of code inside if statements: if model=="old_model", if model=="new_model".
run, update kaggle, and test submission

jjm82 commented 12 months ago

In general, tokenizer seems to be doing well. The following discussion should be useful and the author seems to be a good one to follow: https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/458522

jjm82 commented 12 months ago

An important thing the author said: "In my experiments, I've found that a significant portion of score improvement comes from tweaking the vectorization part."

hunterchewitt-usc / LLM---Detect-AI-Generated-Text

Copy and try out Tokenizer code #11