Open kaansonmezoz opened 2 years ago
Hi @kaansonmezoz ,
thanks for your interest in our models :hugs:
from nltk.tokenize import sent_tokenize
for sent in sent_tokenize(line, "turkish"):
if len(sent.split()) > 5:
print(sent)
So it is not only applied to the OSCAR subcorpus here.
I used sentences longer than 5 tokens (split on whitespaces), see above :)
Not only full stops are considered for sentence segmentation, NLTK has some more tokens to be considered.
I just looked it up in my "data lake", the trwiki-latest-pages-articles.xml.bz2
dump has a 480M 2. Feb 2020
timestamp.
I could found the following OPUS-related files:
bible-uedin.txt GNOME.txt JW300.txt OpenSubtitles.txt opus.all QED.txt SETIMES.txt Tanzil.txt Tatoeba.txt TED2013.txt Wikipedia.txt
With a timestamp of 3. Feb 2020
.
Please just give me your mail addresse and I can immediately send you the link to the corpus used for pre-training :hugs:
Hi @stefan-it, Can i get the links to the corpus used for pre-training? thanks,
Hey @hazalturkmen , no problem, just give me an email addresse where I can contact you :hugs:
Thanks, @stefan-it , Here is my email address:
hazalturkmen91@gmail.com
@stefan-it Thank you for detailed explanation. My email is sonmezozkaan@gmail.com π
You are a life saver ! β€οΈ
Mails are out :hugs:
Hello Stefan,
I'm going to train another BERT model with different pre-training object from scratch. Then I will use it to compare with BERTurk and other Turkish pre-trained language models. In order to evaluate pre-training task impact properly, the model should be trained with similar data and parameters.
In the README file it was state that:
I've already collected Kemal Oflazer's and OSCAR's corpus. But there are things I'm curious about. If you can answer them, I will be happy π
WikiMatrix v1
,Wikipedia
andwikimedia v20210402
. Did you use them too ?Also, if you have the public datasets' corpora, do you mind sharing it ? It would make things a lot easier for me and save me from the trouble π
Thanks in advance π