[x] 1. Write down sentence pairs with similarity less than k, where k = 0, 0.01, 0.02, ... , 0.45
and stored in text files (./notebooks/logs/sentence_pairs.use_multilingual_1.sim_less_than_or_equal_to-{k}.txt)
Result:
Sentence pairs with similarity less than or equal to k:
Try out an idea that if using Universal Sentence Encoder to find a similarity, could it detect sentence pair that is not aligned with each other.
For example
English: That's a lot of shit. Thai: [ผมสัญญาว่าจะทำอุปกรณ์กีฬาให้ดี จัดตั้งกรุ๊ปมิตติ้งกับหญิงโรงเรียนคังซาน; ] [ลดการบ้านให้น้อยลง...
English: Mom, I don't like it here. Thai: 45; 04 โอ พระเจ้า!
English: It seems a rather terrible gentleman on death row confessed to all the crimes you're accused of. Thai: ไอแซค?427 00; 29;
Criteria for filtering out a sentence pair:
where similarity function is defined as
similarity(source, target) = np.inner( UniversalSenenceEncoder(source), UniversalSenenceEncoder(target))
Todos:
./notebooks/logs/sentence_pairs.use_multilingual_1.sim_less_than_or_equal_to-{k}.txt
)Result:
Similarity <= 0.000 , Count: 21815, (link)
Similarity <= 0.010 , Count: 26011, (link)
Similarity <= 0.020 , Count: 30563, (link)
Similarity <= 0.030 , Count: 35452, (link)
Similarity <= 0.040 , Count: 40653, (link)
Similarity <= 0.050 , Count: 46035, (link)
Similarity <= 0.060 , Count: 51836, (link)
Similarity <= 0.070 , Count: 57914, (link)
Similarity <= 0.080 , Count: 64113, (link)
Similarity <= 0.090 , Count: 70402, (link)
Similarity <= 0.100 , Count: 76724, (link)
Similarity <= 0.110 , Count: 83325, (link)
Similarity <= 0.120 , Count: 90037, (link)
Similarity <= 0.130 , Count: 96964, (link)
Similarity <= 0.140 , Count: 103949, (link)
Similarity <= 0.150 , Count: 111493, (link)
Similarity <= 0.160 , Count: 119210, (link)
Similarity <= 0.170 , Count: 127061, (link)
Similarity <= 0.180 , Count: 135320, (link)
Similarity <= 0.190 , Count: 143880, (link)
Similarity <= 0.200 , Count: 152980, (link)
Similarity <= 0.210 , Count: 162409, (link)
Similarity <= 0.220 , Count: 172421, (link)
Similarity <= 0.230 , Count: 183007, (link)
Similarity <= 0.240 , Count: 194556, (link)
Similarity <= 0.250 , Count: 206159, (link)
Similarity <= 0.260 , Count: 218271, (link)
Similarity <= 0.270 , Count: 231600, (link)
Similarity <= 0.280 , Count: 245396, (link)
Similarity <= 0.290 , Count: 260052, (link)
Similarity <= 0.300 , Count: 275305, (link)
Similarity <= 0.310 , Count: 292169, (link)
Similarity <= 0.320 , Count: 309941, (link)
Similarity <= 0.330 , Count: 328636, (link)
Similarity <= 0.340 , Count: 348632, (link)
Similarity <= 0.350 , Count: 369249, (link)
Similarity <= 0.360 , Count: 391631, (link)
Similarity <= 0.370 , Count: 414961, (link)
Similarity <= 0.380 , Count: 439945, (link)
Similarity <= 0.390 , Count: 466008, (link)
Similarity <= 0.400 , Count: 493923, (link)
Similarity <= 0.410 , Count: 523134, (link)
Similarity <= 0.420 , Count: 554502, (link)
Similarity <= 0.430 , Count: 587979, (link)
Similarity <= 0.440 , Count: 622612, (link)
Similarity <= 0.450 , Count: 659468, (link)
Similarity within a range of (-1.000, 0.000], Count: 21815 --- (link)
Similarity within a range of (0.000, 0.010], Count: 4196 --- (link)
Similarity within a range of (0.010, 0.020], Count: 4552 --- (link)
Similarity within a range of (0.020, 0.030], Count: 4889 --- (link)
Similarity within a range of (0.030, 0.040], Count: 5201 --- (link)
Similarity within a range of (0.040, 0.050], Count: 5382 --- (link)
Similarity within a range of (0.050, 0.060], Count: 5801 --- (link)
Similarity within a range of (0.060, 0.070], Count: 6078 --- (link)
Similarity within a range of (0.070, 0.080], Count: 6199 --- (link)
Similarity within a range of (0.080, 0.090], Count: 6289 --- (link)
Similarity within a range of (0.090, 0.100], Count: 6322 --- (link)
Similarity within a range of (0.100, 0.110], Count: 6601 --- (link)
Similarity within a range of (0.110, 0.120], Count: 6712 --- (link)
Similarity within a range of (0.120, 0.130], Count: 6927 --- (link)
Similarity within a range of (0.130, 0.140], Count: 6985 --- (link)
Similarity within a range of (0.140, 0.150], Count: 7544 --- (link)
Similarity within a range of (0.150, 0.160], Count: 7717 --- (link)
Similarity within a range of (0.160, 0.170], Count: 7851 --- (link)
Similarity within a range of (0.170, 0.180], Count: 8259 --- (link)
Similarity within a range of (0.180, 0.190], Count: 8560 --- (link)
Similarity within a range of (0.190, 0.200], Count: 9100 --- (link)
Similarity within a range of (0.200, 0.210], Count: 9429 --- (link)
Similarity within a range of (0.210, 0.220], Count: 10012 --- (link)
Similarity within a range of (0.220, 0.230], Count: 10586 --- (link)
Similarity within a range of (0.230, 0.240], Count: 11549 --- (link)
Similarity within a range of (0.240, 0.250], Count: 11603 --- (link)
Similarity within a range of (0.250, 0.260], Count: 12112 --- (link)
Similarity within a range of (0.260, 0.270], Count: 13329 --- (link)
Similarity within a range of (0.270, 0.280], Count: 13796 --- (link)
Similarity within a range of (0.280, 0.290], Count: 14656 --- (link)
Similarity within a range of (0.290, 0.300], Count: 15253 --- (link)
Similarity within a range of (0.300, 0.310], Count: 16864 --- (link)
Similarity within a range of (0.310, 0.320], Count: 17772 --- (link)
Similarity within a range of (0.320, 0.330], Count: 18695 --- (link)
Similarity within a range of (0.330, 0.340], Count: 19996 --- (link)
Similarity within a range of (0.340, 0.350], Count: 20617 --- (link)
Similarity within a range of (0.350, 0.360], Count: 22382 --- (link)
Similarity within a range of (0.360, 0.370], Count: 23330 --- (link)
Similarity within a range of (0.370, 0.380], Count: 24984 --- (link)
Similarity within a range of (0.380, 0.390], Count: 26063 --- (link)
Similarity within a range of (0.390, 0.400], Count: 27915 --- (link)
Similarity within a range of (0.400, 0.410], Count: 29211 --- (link)
Similarity within a range of (0.410, 0.420], Count: 31368 --- (link)
Similarity within a range of (0.420, 0.430], Count: 33477 --- (link)
Similarity within a range of (0.430, 0.440], Count: 34633 --- (link)
Similarity within a range of (0.440, 0.450], Count: 36856 --- (link)
Examples of sentence pairs with similarity score for all bins (from 0.000 to 0.450), where the bin size is 0.01: