Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data #34

Open jbrry opened 4 years ago

jbrry commented 4 years ago

Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twitter UD treebank.

The current data used in ga_BERT might not be that suitable for parsing tweets. It might be a good idea to create a ga_BERT model tailored to Irish twitter data. This could either be:

ga_BERT (with all data) with continued pre-training on a corpus of Irish tweets.
ga_tweeBERT (or some other name) that is all the data used in ga_BERT + ga twitter data. This is initialised from scratch so the vocab contains code-switched tokens, acronyms, slang etc.

For reference, see: BERTweet

laurenCassidy commented 4 years ago

I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.

jowagner commented 4 years ago

Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.

jowagner commented 4 years ago

Observations:

line.decode('utf-8') for line in binaryfile.readlines() reports no errors
Rich in special characters including many emojis and symbols. Asian and other scripts are present but only small number of letters from each alphabet.
Not yet tokenised. Many cases of an apostrophe being used as a single quote.
Inconsistent html-like encoding: Many occurrences of &, < and > but also 37 occurrences on its own. Two numeric character references '.

jowagner commented 3 years ago

@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.

jbrry commented 3 years ago

I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv deliberately.

jowagner commented 3 years ago

Confirmed in a fresh copy of gdrive_filelist.csv and replaced paper todo with a green note.

$ grep -i tw gdrive_filelist.csv 
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt