jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data #34

Open jbrry opened 3 years ago

jbrry commented 3 years ago

Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twitter UD treebank.

The current data used in ga_BERT might not be that suitable for parsing tweets. It might be a good idea to create a ga_BERT model tailored to Irish twitter data. This could either be:

  1. ga_BERT (with all data) with continued pre-training on a corpus of Irish tweets.
  2. ga_tweeBERT (or some other name) that is all the data used in ga_BERT + ga twitter data. This is initialised from scratch so the vocab contains code-switched tokens, acronyms, slang etc.

For reference, see: BERTweet

laurenCassidy commented 3 years ago

I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.

jowagner commented 3 years ago

Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.

jowagner commented 3 years ago

Observations:

jowagner commented 3 years ago

@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.

jbrry commented 3 years ago

I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv deliberately.

jowagner commented 3 years ago

Confirmed in a fresh copy of gdrive_filelist.csv and replaced paper todo with a green note.

$ grep -i tw gdrive_filelist.csv 
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt