Open kpu opened 4 years ago
Thanks @kpu ! Will add it to our dataset list.
Have noticed it's largely religious - I'd imagine it boils down to being the JW300 + Quran - do you have any sense of what else might have ended up in there?
A quick wordcloud
By the way if you want the really noisy stuff before cleaning https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.classified.gz https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.classified.gz .
Taking a skim over the top sites:
gospelgo.com just bible quotes islamhouse.com is religious but not necessarily quran quotes builttobrag.com is religious but not necessarily bible quotes jw.org you hopefully already have www.grace-and-truth.net and www.waters-of-life.net sound pretty religious.
We seem to have picked up a lot of MT output from the transposh.org plugin: newsrule.com, e-activo.org, transposh.org, and datemypet.com . I should feed back a detector for transposh and throw out all of that.
Domains in en-ha:
49062 gospelgo.com
34275 islamhouse.com
13150 builttobrag.com
4332 www.grace-and-truth.net
3428 www.waters-of-life.net
974 wol.jw.org
806 www.jw.org
562 www.datemypet.com
520 simonlawton.com
439 wap.divinerevelations.info
439 divinerevelations.info
372 www.mcreveil.org
353 www.gotquestions.org
348 www.bitbybitbook.com
328 therefugeecenter.org
240 ukraine.admission.center
204 www.wor.org
192 gotquestions.org
166 bellyfatlossreview.com
164 www.gsm-dinleme.com
129 jesusforafrica.net
116 newsandfeaturesonindonesia.blogspot.com
104 www.healthworksnewcastle.org.uk
94 gospel.net
93 macgateway.com
88 lightministries.com
84 newsrule.com
78 www.mystorywithgod.com
78 englishteacherfred.com
76 www.faithfulwordbaptist.org
60 shroudofturinnews.com
50 www.dw.com
43 alwilayahnews.net
42 www.sayadi-al-nas.com
42 www.sayadi-al-nas.ae
42 sayadi-al-nas.ae
34 dateandtime.info
33 pastorpaulvbsblog.blogspot.in
28 www.iroy.in
25 ldsecrets.com
25 kelleysview.com
25 jesusministriesexhortations.blogspot.de
22 nuriddeen.blogspot.it
21 sayadi-al-nas.com
19 www.juliettebaysham.co.uk
18 www.morehacks.net
17 davidstoutconsulting.com
16 www.xyxx.com.au
14 nuriddeen.blogspot.co.uk
14 islamland.com
14 advancedlawofattractiontraininginstitute.com
13 nuriddeen.blogspot.co.il
12 mt4indicators.com
12 harkarmusulunci.org
12 crushtheciaexam.com
10 www.alhassanain.com
10 www.agrosoftltd.com
10 tambayadaamsa.blogspot.com
10 icine.org
9 archbishop-cranmer.blogspot.ie
8 hinhdep.com.vn
7 mormanity.blogspot.co.il
7 kingswaybibleschool.co.za
5 www.4laws.com
5 sunnibook.blogspot.co.uk
5 rasululaazam.org
5 harunaabubakarshika.blogspot.com.ng
5 anhducblogs.blogspot.fr
5 abandonware.com
4 www.caseguru.com
4 www.bbc.com
4 languagesoftheworld.co.uk
4 kratomonline.org
4 inamafita.blogspot.de
4 global.bfsu.edu.cn
4 fasahar-intanet.blogspot.com.ng
4 alhassanain.com
3 www.sathyaananda.it
3 taskarkanywood.blogspot.com
3 nuriddeen.blogspot.com
3 ministryhouse.org
3 manessmorrison2.blogspot.com
3 jerusalemgraffiti.com
3 israelect.com
3 ismamedicalcareinitiatives.blogspot.com
3 fasahar-intanet.blogspot.in
3 fasahar-intanet.blogspot.co.uk
3 ericmansfield.blogspot.de
3 eastsidebaptistkm.org
3 del-lords.com
3 codingbytodesign.net
3 azrefs.org
2 www.marysrosaries.com
2 www.havenproject-hull.org.uk
2 www.equalityontrial.com
2 thelatterdays.blogspot.nl
2 sureofheaven.blogspot.com
2 standrews-saskatoon.net
2 sherilynshines.blogspot.ca
2 ramadan-1428.blogspot.com
2 ragmopandgoose.com
2 munbarin-musulunci.blogspot.it
2 meridianflights.com
2 knittedbygodsplan.blogspot.com.br
2 kimiyyah.blogspot.in
2 hdfree.se
2 hau.timegenie.com
2 hausa.irib.ir
2 hanaonline.co.uk
2 halofanon.wikia.com
2 freethoughtblogs.com
2 espvisuals.blogspot.hk
2 duniyarcomputer.com
2 dickinsonadventures.com
2 coastalresearch.org
2 clevelandpriest.blogspot.com
2 blbooks.blogspot.ch
2 bahaushensabonkarni.blogspot.com
2 agajingi1.blogspot.com
1 www.yardsalebloodbath.com
1 www.timegenie.com
1 www.pillartopost.org
1 www.hurog.com
1 www.ewtn.com
1 whitefieldsprayer.blogspot.jp
1 trustyourlife.blogspot.com
1 studentofmotherhood.blogspot.ca
1 streema.com
1 stmichael-delaware-oca.org
1 sexualobjectification.blogspot.com.es
1 royaparsay.blogspot.ca
1 raisethethunderbeam.blogspot.fr
1 quoradimonds.blogspot.in
1 perfumedkisses2.blogspot.it
1 news.bbc.co.uk
1 mystical-politics.blogspot.de
1 members.tripod.com
1 inamafita.blogspot.co.uk
1 huboutourvillegenealogy.com
1 harunaabubakarshika.blogspot.it
1 gidandabino.blogspot.de
1 forumresor.se
1 en.a9.com.tr
1 ctvc.se
1 cielodrive.com
1 carmanlicciardello.blogspot.ca
1 battleshippretension.com
1 basicchristian.info
1 atelim.com
1 amen.dk
1 aliyahbyaccident.blogspot.co.id
1 abuashar.blogspot.com
And ig:
19338 gospelgo.com
17542 builttobrag.com
9178 www.e-activo.org
5530 newsrule.com
4536 www.datemypet.com
4388 transposh.org
3000 www.waters-of-life.net
2964 www.gfesport.com
2842 ig.usa-casino-online.com
2778 mt4indicators.com
2190 trenboloneresults.com
2114 www.bitbybitbook.com
1960 www.healthworksnewcastle.org.uk
1894 spyera.com
1796 jobdescriptionsample.org
1795 www.jw.org
1748 crushtheciaexam.com
1696 www.morehacks.net
1650 www.the-tailoress.com
1614 usa-casino-online.com
1366 mobhax.com
1346 ispyoo.com
1276 fr.glosbe.com
1151 wol.jw.org
1121 bellyfatlossreview.com
1104 newsandfeaturesonindonesia.blogspot.com
1062 meridianflights.com
1039 www.iroy.in
929 autocarandinsurance.com
914 tortlay.com
856 dogma.swiftspirit.co.za
852 www.parisdakar.it
793 glosbe.com
722 www.wayscan.com
718 kelleysview.com
623 thetopbestdeals365.co.uk
622 simonlawton.com
554 www.faithfulwordbaptist.org
470 www.neu-presse.de
464 it.glosbe.com
414 www.hkmelamine.com
393 3x247.com
310 ms.glosbe.com
306 hu.glosbe.com
271 nootropicsreview.org
262 www.gsm-dinleme.com
256 powershell-guru.com
250 www.chrysangifts.com
248 softplug.com
245 beththompsonmarketing.com
228 www.promoearte.it
227 datarecoverycompany.net
218 englishteacherfred.com
206 machannkay.com
156 golftipreview.com
156 es.glosbe.com
154 hinhdep.com.vn
151 nl.glosbe.com
146 examprepbooks.com
145 westbrookhousing.org
140 www.atlantacleaningexperts.com
138 de.glosbe.com
136 abacre.com
130 www.bestwaytowhitenteethguide.org
126 vitalizedwater.net
124 www.gotquestions.org
114 advancedlawofattractiontraininginstitute.com
100 cheers4health.com
99 ur.glosbe.com
96 www.mystorywithgod.com
95 vitalizerplus.net
94 gospel.net
90 graphicsecurity.com
88 ltool.net
86 gotquestions.org
84 www.abacre.com
80 educationbro.com
76 shroudofturinnews.com
74 vitalizerplusmineralbasket.com
73 en.glosbe.com
72 thegarrisoncenter.org
68 www.ltool.net
62 www.hanskottke.de
54 www.pennyauctionwizards.com
54 installmobilespy.com
54 celltechnutrition.com
52 www.zhitov.ru
52 www.realdevil.info
52 sw.glosbe.com
50 codingbytodesign.net
50 amara.org
42 landscapersgreenvillesc.com
42 dancinginmyheels.com
40 monsoonbiz.com
38 www.english-video.net
38 crushthecfpexam.com
34 www.unicode.org
34 www.lovemediasoft.com
34 ewenchiabook.com
33 www.kfflooring.com
33 independentflorida.com
33 growfunnel.com
32 michaelhidalgo.net
32 id.glosbe.com
32 icine.org
32 abacre.net
31 vitalizerplus.me
30 kairosplanet.web.tr
28 ilanguages.org
28 freshgamehacks.com
26 learn101.org
25 vitalizerplus.biz
25 emergencywaterdamagecleanup.com
24 www.lotteryextreme.com
24 tl.glosbe.com
24 ka.glosbe.com
22 sanjoserealestatelosgatoshomes.com
20 www.uk-business-plans.co.uk
20 www.jobwhip.com
20 vitalizerplus.info
20 telefonnummervon.com
20 kratompowders.org
20 hcspeech.com
19 hanaonline.co.uk
18 www.promolux.com
16 www.expertmortgage.biz
16 waelbadawy.com
16 tv-online.in
16 statenislandpaintexperts.com
16 rentonroofingcontractor.com
15 igbounionfreiburg.de
14 vitalizer-plus.vitalizerplus.net
14 kratomonline.org
13 yesmoneyyes.com
12 obianyanwu.com
10 www.districtcolumbia.com
10 www.caseguru.com
10 sde.tw
10 mobilespytrial.com
10 gayhub.ru
9 www.bluesummary.com
9 tv-online.im
9 timeinnkwalini.onestophoteldeals.com
9 timeinchillon.onestophoteldeals.com
9 spintaxplrarticles.com
8 support.mozilla.org
8 ldspianohymns.com
8 ketosisfatlossdiet.com
8 jesusforafrica.net
6 painafterrootcanalguide.net
6 horrorwits.com
6 hazzy.harrysoft.co.uk
6 genuinelyabsurd.com
6 avibase.bsc-eoc.org
5 avilestoro.de
5 alexandersalazarfineart.com
4 www.havenproject-hull.org.uk
4 www.euro2016-tickets.com
4 vanipedia.org
4 textclips.it
4 recetadecupcakes.com
4 painreliefpainpatches.com
4 mapcarta.com
4 is.usa-casino-online.com
4 ig.wikipedia.org
4 igbounionfreiburg.com
4 gloria.tv
4 en.wikipedia.org
4 busindia.com
4 7figureautomation.com
3 yanthor.com
3 www.timegenie.com
3 www.lds.org
3 www.javierartiles.com
3 www.illustrators-online.net
3 www.enterthehealingschool.org
3 www.dlsoftware.net
3 www.blackpeoplebusiness.com
3 ibo.timegenie.com
3 enterthehealingschool.org
3 dragoncityhackandcheats.xyz
2 www.womenpriests.org
2 www.glbtguide.com
2 www.econofrost.com
2 www.carpepotentia.com
2 www.bestdraincleaners.com
2 www.answershack.com
2 www.24faster.com
2 qlranks.com
2 nanoinformatica.com
2 mobile.cardiffcityrumours.co.uk
2 immolucky.com
2 fighterstalk.com
2 fighters-quest.com
2 fatcutters.com
2 dominicweb.eu
2 crushthegretest.com
2 animalcoloring.blogspot.com
1 www.repetitivestrain.org
1 www.auspisoft.com
1 www.albionchoir.org.uk
1 vitalizerplusmineralcube.com
1 psychologytomorrowmagazine.com
1 julie-compton.com
1 imonews24.com
1 guntherspaps.blogspot.ca
1 dotted-carrier-798.appspot.com
1 autolicenseplate.com
1 angrydr.blogspot.sg
That's super cool, thanks for sharing @kpu! What's the license for this data?
License is the usual one on paracrawl.eu.
So much of this is machine translated though. Most likely they are watermarked by Google, but Google has not to my knowledge documented the hash function.
Some translation plugins leave calling cards in the HTML:
<link rel="alternate machine-translated-from" hreflang="en" href="/en/observing-behavior/observing-further/">
<link rel='stylesheet' id='gtranslate-style-css'
<meta name="translation-stats"
in these sites: http://transposh.org/es/ https://www.datemypet.com/ha/the_dog_walker https://www.e-activo.org/en/eres-puntual/ http://www.balkan-transporte.de/spedition/en/suedeuropa-suedosteuropa/speditionen-kosovo/ http://realtyalbania.com/is/contact-us/ https://www.outlookexpresstooutlook.com/is/download/ <script data-cfasync="false" src="https://tdns3.gtranslate.net/tdn-bin/queue.js">
<aside id="gtranslate-6" class="widget widget_gtranslate">
We're going to go back to the original HTML and throw out pages with machine translation indica like these.
After that, I'd really appreciate help from the community in identifying domains with obvious MT output (which should be easier for low-resource languages!) so we can ban them and release a cleaner corpus.
Hi! In a collaboration between https://gourmet-project.eu/ and https://paracrawl.eu/ , have some parallel corpora. It's so new we haven't linked to it from the website yet.
The raw data comes from Internet Archive WIDE0006, Internet Archive WIDE00015, and our own crawl. Our own crawl was targeted at sites in CommonCrawl that had enough of at least two EU languages but then we crawled the whole domain.
Text: https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.txt.gz https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.txt.gz
The same in TMX: https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ha.tmx.gz https://s3.amazonaws.com/web-language-models/paracrawl/bonus/en-ig.tmx.gz