jbrry Irish-BERT issues

jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.

Other

9 stars 0 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Computation budget plot

#79 jowagner opened 3 years ago
0
Switch to UD v2.8?

#78 jowagner closed 3 years ago
2
Upgrade to Paracrawl v9

#77 jowagner opened 3 years ago
8
Use Electra for development instead of bert-128?

#76 jowagner closed 3 years ago
11
Is XPOS prediction task in fine-tuning confusing the model?

#75 jowagner opened 3 years ago
1
Effect of joining sentence piece and word piece vocabulary

#74 jowagner closed 3 years ago
1
Effect of filtering (near) duplicates

#73 jowagner opened 3 years ago
0
Models we trained for summer 2021 (was: New models to be run)

#72 jbrry closed 3 years ago
9
paper: need list of all funding for the acknowledgements

#71 jowagner closed 2 years ago
1
license text: expand list of grants to include all funding sources

#70 fosterjen opened 3 years ago
1
split_tokenised_text_into_sentences.py outputs one token per line instead of one sentence per line

#69 jbrry closed 3 years ago
2
paper: compare ga_bert against xml-r

#68 jowagner closed 3 years ago
1
Pipeline reports null bytes and jwagner file metadata in oscar corpus

#67 jowagner closed 3 years ago
3
Apply udpipe sentence splitter on appropriate units of text

#66 jbrry closed 3 years ago
4
Is training data memorisation reduced if we train for fewest epochs with good task performance?

#65 jowagner opened 3 years ago
3
All lowercased output from nlp.tokeniser with pipeline and our model

#64 jowagner closed 3 years ago
5
Support multiple MASK tokens in LM inspector

#63 jowagner opened 3 years ago
0
Rename unusable vocabulary entries

#62 jowagner opened 3 years ago
8
Need to edit existing BERT code to save more than n checkpoints

#61 jbrry opened 3 years ago
0
Check anonymised version of DCHG corpus for character issues

#60 jowagner opened 3 years ago
0
options to support issue #58

#59 jowagner closed 3 years ago
0
Check importance of sentence splitting

#58 jowagner opened 3 years ago
10
Check for toxic content or ability to generate toxic output

#57 jowagner opened 3 years ago
0
Teach BERT more about sentence boundaries

#56 jowagner opened 3 years ago
1
Train an electra model

#55 jowagner opened 3 years ago
3
Restrict BERT vocabulary building to clean corpora

#54 jowagner opened 3 years ago
0
Increase weight of clean corpora such as NCI

#53 jowagner opened 3 years ago
1
Enh sentsplit

#52 jowagner closed 3 years ago
1
Inspect how BERT tokenization affects tokens which are composed of characters and punctuation

#51 jowagner closed 3 years ago
10
Concatenate output of different tokenisers

#50 jowagner opened 3 years ago
1
Is confidential training data sufficiently protected?

#49 jowagner opened 3 years ago
2
feature request: download handlers to skip existing files

#48 jowagner opened 3 years ago
0
readme: what bucketsize should be used?

#47 jowagner closed 2 years ago
3
What happens with sentences greater than 128 tokens in length with BERT

#46 jbrry opened 3 years ago
0
Improve sentence splitter for tokenised text

#45 jowagner opened 3 years ago
6
Investigate sentpiece vocabulary conversion

#44 jowagner opened 3 years ago
0
Can we move gdrive_filelist.csv to the repo?

#43 jowagner opened 3 years ago
5
Populate unused vocabulary entries of our mBERT-based models

#42 jowagner opened 3 years ago
1
Include unused entries in vocabulary of "from scratch" models

#41 jowagner closed 3 years ago
6
Why are long sentences removed?

#40 jowagner opened 3 years ago
3
What is filtered out?

#39 jowagner opened 3 years ago
6
Switch to 2.7 of IDT

#38 fosterjen closed 3 years ago
1
Experiment with latest version of paracrawl (7.1)

#37 fosterjen closed 3 years ago
1
Handling of new emoji and other OOVs

#36 jowagner opened 3 years ago
0
Provide up-to-date pre-processed text files

#35 jowagner opened 3 years ago
8
Create a ga_BERT model which does continued pre-training on Irish Tweets or is trained from scratch with twitter data

#34 jbrry opened 4 years ago
6
Merge subcorpus-specific wordpiece vocabularies

#33 jowagner opened 4 years ago
5
Decoding in text files (character reference entities)

#32 alanagiasi opened 4 years ago
6
NCI: inconsistent <s> and <p> tags

#31 jowagner closed 4 years ago
1
Robustness to missing accents, all-caps text and other deviations from well-edited text

#30 jowagner opened 4 years ago
2

Previous Next