Open jbrry opened 4 years ago
I have added a folder to the Irish Data folder on the drive. It has 2 text files of tweets (26493 tweets in total) and a README to describe how they were gathered.
Created https://github.com/jbrry/Irish-UD-Parsing/issues/9 on supporting unsupervised domain adaptation in our parser.
Observations:
line.decode('utf-8') for line in binaryfile.readlines()
reports no errors&
, <
and >
but also 37 occurrences on its own. Two numeric character references '
.@jbrry Is the folder Lauren added to the Irish Folder part of our current pipeline? I don't see it mentioned in Sec 2 of our paper. It should not be hidden under IMT, especially if Dowling et al. (2018, 2020) do not describe it.
I believe we decided to exclude it from the model pretraining data but decided to incorporate some experiment where we use it as fine tuning data. I'm not sure if I have that in writing anywhere but I remember the general consensus being that it wasn't urgent to add it in to the pipeline and all twitter related files were marked with a 0 in our gdrive_filelist.csv
deliberately.
Confirmed in a fresh copy of gdrive_filelist.csv
and replaced paper todo with a green note.
$ grep -i tw gdrive_filelist.csv
0,data/ga/gdrive/Tweets/Lauren_twitter_corpus.txt
0,data/ga/gdrive/Tweets/README.md
0,data/ga/gdrive/Tweets/Teresa_twitter_corpus.txt
Lauren mentioned that she will be using parser-bootstapping to annotate the Irish Twitter UD treebank.
The current data used in ga_BERT might not be that suitable for parsing tweets. It might be a good idea to create a ga_BERT model tailored to Irish twitter data. This could either be:
ga_BERT
(with all data) with continued pre-training on a corpus of Irish tweets.ga_tweeBERT
(or some other name) that is all the data used in ga_BERT +ga
twitter data. This is initialised from scratch so the vocab contains code-switched tokens, acronyms, slang etc.For reference, see: BERTweet