issues
search
jbrry
/
Irish-BERT
Repository to store helper scripts for creating an Irish BERT model.
Other
9
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Computation budget plot
#79
jowagner
opened
3 years ago
0
Switch to UD v2.8?
#78
jowagner
closed
3 years ago
2
Upgrade to Paracrawl v9
#77
jowagner
opened
3 years ago
8
Use Electra for development instead of bert-128?
#76
jowagner
closed
3 years ago
11
Is XPOS prediction task in fine-tuning confusing the model?
#75
jowagner
opened
3 years ago
1
Effect of joining sentence piece and word piece vocabulary
#74
jowagner
closed
3 years ago
1
Effect of filtering (near) duplicates
#73
jowagner
opened
3 years ago
0
Models we trained for summer 2021 (was: New models to be run)
#72
jbrry
closed
3 years ago
9
paper: need list of all funding for the acknowledgements
#71
jowagner
closed
2 years ago
1
license text: expand list of grants to include all funding sources
#70
fosterjen
opened
3 years ago
1
split_tokenised_text_into_sentences.py outputs one token per line instead of one sentence per line
#69
jbrry
closed
3 years ago
2
paper: compare ga_bert against xml-r
#68
jowagner
closed
3 years ago
1
Pipeline reports null bytes and jwagner file metadata in oscar corpus
#67
jowagner
closed
3 years ago
3
Apply udpipe sentence splitter on appropriate units of text
#66
jbrry
closed
3 years ago
4
Is training data memorisation reduced if we train for fewest epochs with good task performance?
#65
jowagner
opened
3 years ago
3
All lowercased output from nlp.tokeniser with pipeline and our model
#64
jowagner
closed
3 years ago
5
Support multiple MASK tokens in LM inspector
#63
jowagner
opened
3 years ago
0
Rename unusable vocabulary entries
#62
jowagner
opened
3 years ago
8
Need to edit existing BERT code to save more than n checkpoints
#61
jbrry
opened
3 years ago
0
Check anonymised version of DCHG corpus for character issues
#60
jowagner
opened
3 years ago
0
options to support issue #58
#59
jowagner
closed
3 years ago
0
Check importance of sentence splitting
#58
jowagner
opened
3 years ago
10
Check for toxic content or ability to generate toxic output
#57
jowagner
opened
3 years ago
0
Teach BERT more about sentence boundaries
#56
jowagner
opened
3 years ago
1
Train an electra model
#55
jowagner
opened
3 years ago
3
Restrict BERT vocabulary building to clean corpora
#54
jowagner
opened
3 years ago
0
Increase weight of clean corpora such as NCI
#53
jowagner
opened
3 years ago
1
Enh sentsplit
#52
jowagner
closed
3 years ago
1
Inspect how BERT tokenization affects tokens which are composed of characters and punctuation
#51
jowagner
closed
3 years ago
10
Concatenate output of different tokenisers
#50
jowagner
opened
3 years ago
1
Is confidential training data sufficiently protected?
#49
jowagner
opened
3 years ago
2
feature request: download handlers to skip existing files
#48
jowagner
opened
3 years ago
0
readme: what bucketsize should be used?
#47
jowagner
closed
2 years ago
3
What happens with sentences greater than 128 tokens in length with BERT
#46
jbrry
opened
3 years ago
0
Improve sentence splitter for tokenised text
#45
jowagner
opened
3 years ago
6
Investigate sentpiece vocabulary conversion
#44
jowagner
opened
3 years ago
0
Can we move gdrive_filelist.csv to the repo?
#43
jowagner
opened
3 years ago
5
Populate unused vocabulary entries of our mBERT-based models
#42
jowagner
opened
3 years ago
1
Include unused entries in vocabulary of "from scratch" models
#41
jowagner
closed
3 years ago
6
Why are long sentences removed?
#40
jowagner
opened
3 years ago
3
What is filtered out?
#39
jowagner
opened
3 years ago
6
Switch to 2.7 of IDT
#38
fosterjen
closed
3 years ago
1
Experiment with latest version of paracrawl (7.1)
#37
fosterjen
closed
3 years ago
1
Handling of new emoji and other OOVs
#36
jowagner
opened
3 years ago
0
Provide up-to-date pre-processed text files
#35
jowagner
opened
3 years ago
8
Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data
#34
jbrry
opened
4 years ago
6
Merge subcorpus-specific wordpiece vocabularies
#33
jowagner
opened
4 years ago
5
Decoding in text files (character reference entities)
#32
alanagiasi
opened
4 years ago
6
NCI: inconsistent <s> and <p> tags
#31
jowagner
closed
4 years ago
1
Robustness to missing accents, all-caps text and other deviations from well-edited text
#30
jowagner
opened
4 years ago
2
Previous
Next