nathanshartmann / portuguese_word_embeddings

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
GNU General Public License v3.0
240 stars 35 forks source link

Invalid lines in embeddings #3

Closed pvcastro closed 6 years ago

pvcastro commented 6 years ago

Hi there! Great job on these embeddings!

I was using them to train a NER model, and I realized that some of the lines are in an invalid format, such as containing two tokens before the word vector, or maybe not token at all before the vector. Here are some examples I collected (I'm attaching all of them in this issue):

FastText:

skip_s100_invalid_lines.txt

linha 2618: 0,0 km² 1.3471 -0.89272 0.24963 -0.43362 1.4181 -1.1619 0.75171 0.20959 0.18636 0.65931 1.1528 -0.29139 1.3373 -0.11006 -0.72911 -0.79949 -0.76637 0.043631 -0.15634 0.60977 0.58492 1.057 -1.804 -0.34065 0.068916 -1.2928 0.76522 0.11291 -0.23761 -1.0092 0.52692 1.0372 -0.34624 -0.76549 -1.0159 0.41543 -1.128 0.32626 0.50076 0.73435 -0.92034 -0.42143 1.7997 0.72186 0.36473 1.139 -1.1148 1.7712 -0.12962 -1.0424 -0.27582 -0.045895 0.069847 0.013625 -0.55709 -0.16886 -0.022214 0.50913 0.7321 -0.57073 0.095712 -0.11376 0.18697 0.55433 0.072265 -0.080268 0.99282 0.16483 0.22592 0.53295 -1.3126 0.023292 0.88148 1.3048 -0.6718 -0.89945 0.58785 -0.97323 0.34167 1.8696 -0.68135 0.076409 0.62791 -0.40515 0.3237 0.72617 -0.41143 0.036656 0.24826 -0.81591 0.41292 0.537 0.20369 0.80803 -0.75 0.98709 -0.50751 -0.6739 -0.29898 -0.23988
linha 3906: ã s 0.31366 -0.46417 -0.2101 0.6402 1.0168 0.62714 0.94433 -0.79981 0.52926 -0.067065 0.99118 0.17664 -0.22477 -0.93925 -0.28902 -0.09314 -0.52883 0.13341 -0.25957 0.80455 0.84604 0.081927 0.087947 -0.40282 -1.2475 -0.90543 0.50472 -0.263 -0.69297 0.31247 -0.47517 0.46997 -0.12649 -0.58831 0.61016 0.53584 -0.83699 0.32526 0.081563 0.40073 -1.5383 0.43729 0.45408 -1.1574 0.35063 0.19737 0.70146 0.62286 0.10975 -0.47485 0.47548 0.25552 0.89394 0.54781 0.1901 0.30729 -0.63491 0.3869 0.39235 0.19513 0.33399 -0.52563 -0.04087 -0.23115 0.20164 0.28505 0.62356 0.046043 -0.32291 -0.75999 0.58555 0.56726 0.76853 0.21236 -0.79027 0.3673 0.71262 0.018268 -0.37527 0.002464 0.0016207 -0.20957 -0.082549 -0.2598 -0.61734 1.0905 0.38158 -0.050387 0.064086 -0.13071 -0.73415 -0.29921 0.97657 0.43263 -0.1172 0.22682 0.052972 -0.17632 -0.13901 -0.65764

Glove:

glove_s100_invalid_lines.txt

linha 4311: 00 km -1.096497 -0.115348 0.533725 -0.780431 0.811165 0.062700 0.358095 -0.824477 -0.150925 0.983429 0.943246 1.464778 0.209856 -0.249053 -1.210661 0.569873 0.791388 0.942387 0.548097 -1.028815 0.333276 0.095697 0.329763 -1.049485 0.201676 0.260916 -0.576804 0.575832 -0.641334 0.775280 1.265637 0.630137 1.094773 -0.172359 -0.183386 0.695177 0.185336 0.106616 0.611617 -0.290683 0.593486 -0.200387 -0.373662 0.566315 -0.257140 -0.448474 -0.523373 -1.374823 0.407349 1.321078 0.361554 -0.825517 0.677784 -0.156055 0.175524 -0.560491 0.011153 -0.284771 -0.653020 0.376460 0.052716 -0.398949 -0.523565 -0.430178 0.311725 0.773853 -0.093564 -0.140165 -0.746921 -0.350721 0.895749 -0.025408 0.265322 0.758847 0.895762 -0.341492 0.380499 0.127606 -0.437591 0.639467 0.029006 -0.166302 0.237734 0.484939 1.342966 -0.384358 -0.333631 -0.612093 0.747059 0.286568 0.118164 -0.534590 -0.258652 0.258338 0.038046 0.224927 0.346821 -0.026168 1.198875 0.945606
linha 5357: -0.291527 -0.207144 -0.194693 0.296796 0.291955 -0.057791 -0.345879 0.186292 0.172007 0.560378 -0.280491 -0.027786 0.275376 -0.028554 -0.811503 0.139032 0.309938 0.034639 0.220700 0.491905 -0.324429 0.379279 0.049063 0.139276 0.133582 0.069576 0.034449 -0.756578 0.040279 0.264051 -0.363349 0.542291 -0.076136 0.032225 -0.087384 -0.030874 0.693043 -0.068561 -0.045448 -0.084627 -0.121269 0.321651 -0.134468 -0.026799 0.143538 -0.638123 -0.279734 0.277754 -0.478550 -0.189353 0.435826 0.066824 -0.076424 0.411868 -0.221163 -0.035606 0.013376 -0.115047 0.012623 0.229935 0.253806 -0.206941 0.260670 0.143976 -0.074355 -0.009542 0.193227 0.147870 0.319481 -0.292567 -0.569469 -0.088245 0.121289 -0.168956 0.472040 0.140808 0.109296 -0.338164 -0.227513 0.313779 -0.117033 -0.416819 -0.511378 -0.216577 0.739741 -0.078641 -0.034593 0.021032 0.163312 0.190173 -0.521095 0.129220 -0.104790 0.265647 0.140339 -0.431014 0.510587 -0.005205 -0.969191 0.232102

Wang2Vec:

skip_s100_invalid_lines.txt

linha 7846: 00,00000000 km/s 0.240993 -0.942322 1.600320 2.077391 -0.437721 1.297975 -0.812284 -1.310465 -1.348463 0.607501 -1.628980 -3.357586 0.441017 -1.969249 -0.532445 -0.779277 1.145346 1.323124 0.036780 -0.931161 0.165157 1.136547 2.136952 -4.218977 -1.187829 0.170832 1.432486 0.803935 1.366924 0.308831 1.272064 -1.100312 -2.382427 0.523305 1.807923 -2.486763 -0.183749 0.611948 0.744579 -1.310416 -0.284388 3.821928 -0.909492 -1.045277 0.271173 -1.197620 -2.247871 -1.635551 -2.900197 -1.622078 0.620783 2.011719 0.018628 -0.637541 -0.697962 -2.199550 -0.254782 -2.114877 -1.237467 -1.303796 2.050109 -0.162425 1.135828 -0.057633 2.631514 1.300285 1.556403 -2.677551 0.374641 -0.242459 -0.457558 -1.584399 2.818794 0.216729 -1.010701 -1.776988 -0.405578 -0.661738 -1.394717 0.366224 -1.088951 1.938241 -0.616742 -0.179179 -0.179525 2.822714 0.170322 -0.649997 0.168230 -1.441733 0.483268 -1.246915 -2.006533 0.208874 -2.711892 -1.008742 3.018411 -0.081341 1.063304 2.420181
linha 9243: 00,0 km² 1.651085 -1.506577 1.520257 -0.205966 -1.928894 0.998100 2.195940 -0.961068 0.351457 0.482292 -0.151572 -1.338510 -0.463367 -0.693778 -0.650503 -0.406332 2.583741 0.993141 2.598100 -1.047690 -1.740937 0.072660 2.609116 -1.648594 -0.082490 -1.155029 2.213265 -1.073674 3.068192 -0.721511 -1.001577 -1.458915 -2.073348 0.981863 -0.553089 -0.040057 -1.389118 0.078736 2.475482 -1.140503 0.143878 -0.893640 -1.510324 -2.826583 0.114413 -1.218856 -0.990987 1.237282 -1.965299 -0.283821 0.757455 0.554541 -0.913564 -0.599824 2.179213 0.811821 -0.490595 -1.395128 -0.619808 -1.689423 0.831838 0.506574 0.977351 -0.003260 -0.375976 1.611319 0.006866 0.595251 1.554341 0.082847 -0.714754 -1.541869 0.210436 -0.917407 -0.883670 -1.164431 -1.184115 -0.660099 -1.314602 -0.042511 1.074625 0.165777 -1.497031 -0.666364 2.369030 -1.166566 0.089672 0.908849 1.599464 -0.711212 -1.640689 1.568981 -2.217598 0.966219 -0.091045 -0.697727 -0.870052 2.419719 -0.822139 0.927472

Word2Vec:

skip_s100_invalid_lines.txt

linha 18596: -0.028997 0.013575 0.062561 -0.201263 -0.583117 -0.217628 0.022837 -0.170598 0.099084 -0.651933 0.033815 -0.245818 0.435562 0.058953 -0.119798 0.205226 -0.429767 -0.250009 -0.015448 -0.083469 0.485238 -0.224591 0.545159 0.212638 -0.339153 -0.132941 0.010644 0.101276 0.096178 0.422906 -0.239319 -0.023387 -0.321550 -0.085336 -0.013101 0.046811 -0.502934 -0.369628 0.077045 -0.023889 0.454306 -0.282537 0.106553 0.081499 -0.335621 -0.191669 -0.016532 -0.067348 0.511277 0.104629 -0.328300 0.163664 0.811827 0.046643 0.122906 -0.051964 -0.240036 0.149031 -0.235158 0.615142 0.008031 0.149618 -0.531543 0.414220 -0.211937 -0.213367 0.070715 -0.171367 0.223777 0.147169 -0.412582 0.242935 0.454417 0.268225 -0.066188 -0.282406 0.455986 0.129857 -0.169192 -0.357411 -0.368528 -0.555435 0.152939 0.075229 -0.079797 -0.420233 0.153397 0.177482 0.295766 0.456925 -0.625738 -0.558847 0.046555 -0.111168 -0.629639 -0.048618 0.046491 0.378669 0.363404 0.432986
linha 19084: 00 000 0.155477 0.865527 -0.188853 0.765310 0.871128 0.516933 -0.470242 -0.320178 -0.308708 0.139145 0.258087 -0.124555 0.800621 0.327414 -0.105571 -0.645862 -1.363663 1.377966 0.763156 -0.074988 0.254229 -0.252865 0.510145 -0.462393 -0.405541 0.564130 0.197429 -0.015470 0.018540 -0.859302 0.207488 -0.903036 -0.877624 0.850438 -0.263961 0.003113 0.523423 -0.314306 0.051861 -0.332538 1.166323 -0.309925 0.077705 0.063017 0.586330 -0.287491 -0.029791 0.508337 -0.685614 -0.494796 0.514873 0.109300 -1.321291 0.180958 -0.994851 -0.276255 -0.819886 0.693954 0.596928 0.328566 -0.168327 -0.463682 -0.407658 0.546542 0.501838 0.160602 -0.119224 0.698923 0.336641 -0.057307 -0.040251 -0.070688 -0.320412 -0.133106 -0.513218 0.103101 0.351398 -0.990816 -1.052826 0.124401 -0.495737 -0.059665 -0.161377 -0.344754 0.061336 0.344841 -0.037125 -0.410835 -0.857732 -0.045314 0.241486 -0.485521 0.442395 -0.474608 -0.514554 0.141617 -0.168770 -0.015196 0.706256 0.522127

Thanks!

nathanshartmann commented 6 years ago

Thank you very much for you analyze. I will update our models as soon as possible and make them available.

Best regards.

Nathan Siegle Hartmann

Em 9 de mai de 2018, à(s) 06:49, Pedro Vitor Quinta de Castro notifications@github.com escreveu:

Hi there! Great job on these embeddings!

I was using them to train a NER model, and I realized that some of the lines are in an invalid format, such as containing two tokens before the word vector, or maybe not token at all before the vector. Here are some examples I collected (I'm attaching all of them in this issue):

FastText:

skip_s100_invalid_lines.txt

linha 2618: 0,0 km² 1.3471 -0.89272 0.24963 -0.43362 1.4181 -1.1619 0.75171 0.20959 0.18636 0.65931 1.1528 -0.29139 1.3373 -0.11006 -0.72911 -0.79949 -0.76637 0.043631 -0.15634 0.60977 0.58492 1.057 -1.804 -0.34065 0.068916 -1.2928 0.76522 0.11291 -0.23761 -1.0092 0.52692 1.0372 -0.34624 -0.76549 -1.0159 0.41543 -1.128 0.32626 0.50076 0.73435 -0.92034 -0.42143 1.7997 0.72186 0.36473 1.139 -1.1148 1.7712 -0.12962 -1.0424 -0.27582 -0.045895 0.069847 0.013625 -0.55709 -0.16886 -0.022214 0.50913 0.7321 -0.57073 0.095712 -0.11376 0.18697 0.55433 0.072265 -0.080268 0.99282 0.16483 0.22592 0.53295 -1.3126 0.023292 0.88148 1.3048 -0.6718 -0.89945 0.58785 -0.97323 0.34167 1.8696 -0.68135 0.076409 0.62791 -0.40515 0.3237 0.72617 -0.41143 0.036656 0.24826 -0.81591 0.41292 0.537 0.20369 0.80803 -0.75 0.98709 -0.50751 -0.6739 -0.29898 -0.23988 linha 3906: ã s 0.31366 -0.46417 -0.2101 0.6402 1.0168 0.62714 0.94433 -0.79981 0.52926 -0.067065 0.99118 0.17664 -0.22477 -0.93925 -0.28902 -0.09314 -0.52883 0.13341 -0.25957 0.80455 0.84604 0.081927 0.087947 -0.40282 -1.2475 -0.90543 0.50472 -0.263 -0.69297 0.31247 -0.47517 0.46997 -0.12649 -0.58831 0.61016 0.53584 -0.83699 0.32526 0.081563 0.40073 -1.5383 0.43729 0.45408 -1.1574 0.35063 0.19737 0.70146 0.62286 0.10975 -0.47485 0.47548 0.25552 0.89394 0.54781 0.1901 0.30729 -0.63491 0.3869 0.39235 0.19513 0.33399 -0.52563 -0.04087 -0.23115 0.20164 0.28505 0.62356 0.046043 -0.32291 -0.75999 0.58555 0.56726 0.76853 0.21236 -0.79027 0.3673 0.71262 0.018268 -0.37527 0.002464 0.0016207 -0.20957 -0.082549 -0.2598 -0.61734 1.0905 0.38158 -0.050387 0.064086 -0.13071 -0.73415 -0.29921 0.97657 0.43263 -0.1172 0.22682 0.052972 -0.17632 -0.13901 -0.65764 Glove:

glove_s100_invalid_lines.txt

linha 4311: 00 km -1.096497 -0.115348 0.533725 -0.780431 0.811165 0.062700 0.358095 -0.824477 -0.150925 0.983429 0.943246 1.464778 0.209856 -0.249053 -1.210661 0.569873 0.791388 0.942387 0.548097 -1.028815 0.333276 0.095697 0.329763 -1.049485 0.201676 0.260916 -0.576804 0.575832 -0.641334 0.775280 1.265637 0.630137 1.094773 -0.172359 -0.183386 0.695177 0.185336 0.106616 0.611617 -0.290683 0.593486 -0.200387 -0.373662 0.566315 -0.257140 -0.448474 -0.523373 -1.374823 0.407349 1.321078 0.361554 -0.825517 0.677784 -0.156055 0.175524 -0.560491 0.011153 -0.284771 -0.653020 0.376460 0.052716 -0.398949 -0.523565 -0.430178 0.311725 0.773853 -0.093564 -0.140165 -0.746921 -0.350721 0.895749 -0.025408 0.265322 0.758847 0.895762 -0.341492 0.380499 0.127606 -0.437591 0.639467 0.029006 -0.166302 0.237734 0.484939 1.342966 -0.384358 -0.333631 -0.612093 0.747059 0.286568 0.118164 -0.534590 -0.258652 0.258338 0.038046 0.224927 0.346821 -0.026168 1.198875 0.945606 linha 5357: -0.291527 -0.207144 -0.194693 0.296796 0.291955 -0.057791 -0.345879 0.186292 0.172007 0.560378 -0.280491 -0.027786 0.275376 -0.028554 -0.811503 0.139032 0.309938 0.034639 0.220700 0.491905 -0.324429 0.379279 0.049063 0.139276 0.133582 0.069576 0.034449 -0.756578 0.040279 0.264051 -0.363349 0.542291 -0.076136 0.032225 -0.087384 -0.030874 0.693043 -0.068561 -0.045448 -0.084627 -0.121269 0.321651 -0.134468 -0.026799 0.143538 -0.638123 -0.279734 0.277754 -0.478550 -0.189353 0.435826 0.066824 -0.076424 0.411868 -0.221163 -0.035606 0.013376 -0.115047 0.012623 0.229935 0.253806 -0.206941 0.260670 0.143976 -0.074355 -0.009542 0.193227 0.147870 0.319481 -0.292567 -0.569469 -0.088245 0.121289 -0.168956 0.472040 0.140808 0.109296 -0.338164 -0.227513 0.313779 -0.117033 -0.416819 -0.511378 -0.216577 0.739741 -0.078641 -0.034593 0.021032 0.163312 0.190173 -0.521095 0.129220 -0.104790 0.265647 0.140339 -0.431014 0.510587 -0.005205 -0.969191 0.232102 Wang2Vec:

skip_s100_invalid_lines.txt

linha 7846: 00,00000000 km/s 0.240993 -0.942322 1.600320 2.077391 -0.437721 1.297975 -0.812284 -1.310465 -1.348463 0.607501 -1.628980 -3.357586 0.441017 -1.969249 -0.532445 -0.779277 1.145346 1.323124 0.036780 -0.931161 0.165157 1.136547 2.136952 -4.218977 -1.187829 0.170832 1.432486 0.803935 1.366924 0.308831 1.272064 -1.100312 -2.382427 0.523305 1.807923 -2.486763 -0.183749 0.611948 0.744579 -1.310416 -0.284388 3.821928 -0.909492 -1.045277 0.271173 -1.197620 -2.247871 -1.635551 -2.900197 -1.622078 0.620783 2.011719 0.018628 -0.637541 -0.697962 -2.199550 -0.254782 -2.114877 -1.237467 -1.303796 2.050109 -0.162425 1.135828 -0.057633 2.631514 1.300285 1.556403 -2.677551 0.374641 -0.242459 -0.457558 -1.584399 2.818794 0.216729 -1.010701 -1.776988 -0.405578 -0.661738 -1.394717 0.366224 -1.088951 1.938241 -0.616742 -0.179179 -0.179525 2.822714 0.170322 -0.649997 0.168230 -1.441733 0.483268 -1.246915 -2.006533 0.208874 -2.711892 -1.008742 3.018411 -0.081341 1.063304 2.420181 linha 9243: 00,0 km² 1.651085 -1.506577 1.520257 -0.205966 -1.928894 0.998100 2.195940 -0.961068 0.351457 0.482292 -0.151572 -1.338510 -0.463367 -0.693778 -0.650503 -0.406332 2.583741 0.993141 2.598100 -1.047690 -1.740937 0.072660 2.609116 -1.648594 -0.082490 -1.155029 2.213265 -1.073674 3.068192 -0.721511 -1.001577 -1.458915 -2.073348 0.981863 -0.553089 -0.040057 -1.389118 0.078736 2.475482 -1.140503 0.143878 -0.893640 -1.510324 -2.826583 0.114413 -1.218856 -0.990987 1.237282 -1.965299 -0.283821 0.757455 0.554541 -0.913564 -0.599824 2.179213 0.811821 -0.490595 -1.395128 -0.619808 -1.689423 0.831838 0.506574 0.977351 -0.003260 -0.375976 1.611319 0.006866 0.595251 1.554341 0.082847 -0.714754 -1.541869 0.210436 -0.917407 -0.883670 -1.164431 -1.184115 -0.660099 -1.314602 -0.042511 1.074625 0.165777 -1.497031 -0.666364 2.369030 -1.166566 0.089672 0.908849 1.599464 -0.711212 -1.640689 1.568981 -2.217598 0.966219 -0.091045 -0.697727 -0.870052 2.419719 -0.822139 0.927472 Word2Vec:

skip_s100_invalid_lines.txt

linha 18596: -0.028997 0.013575 0.062561 -0.201263 -0.583117 -0.217628 0.022837 -0.170598 0.099084 -0.651933 0.033815 -0.245818 0.435562 0.058953 -0.119798 0.205226 -0.429767 -0.250009 -0.015448 -0.083469 0.485238 -0.224591 0.545159 0.212638 -0.339153 -0.132941 0.010644 0.101276 0.096178 0.422906 -0.239319 -0.023387 -0.321550 -0.085336 -0.013101 0.046811 -0.502934 -0.369628 0.077045 -0.023889 0.454306 -0.282537 0.106553 0.081499 -0.335621 -0.191669 -0.016532 -0.067348 0.511277 0.104629 -0.328300 0.163664 0.811827 0.046643 0.122906 -0.051964 -0.240036 0.149031 -0.235158 0.615142 0.008031 0.149618 -0.531543 0.414220 -0.211937 -0.213367 0.070715 -0.171367 0.223777 0.147169 -0.412582 0.242935 0.454417 0.268225 -0.066188 -0.282406 0.455986 0.129857 -0.169192 -0.357411 -0.368528 -0.555435 0.152939 0.075229 -0.079797 -0.420233 0.153397 0.177482 0.295766 0.456925 -0.625738 -0.558847 0.046555 -0.111168 -0.629639 -0.048618 0.046491 0.378669 0.363404 0.432986 linha 19084: 00 000 0.155477 0.865527 -0.188853 0.765310 0.871128 0.516933 -0.470242 -0.320178 -0.308708 0.139145 0.258087 -0.124555 0.800621 0.327414 -0.105571 -0.645862 -1.363663 1.377966 0.763156 -0.074988 0.254229 -0.252865 0.510145 -0.462393 -0.405541 0.564130 0.197429 -0.015470 0.018540 -0.859302 0.207488 -0.903036 -0.877624 0.850438 -0.263961 0.003113 0.523423 -0.314306 0.051861 -0.332538 1.166323 -0.309925 0.077705 0.063017 0.586330 -0.287491 -0.029791 0.508337 -0.685614 -0.494796 0.514873 0.109300 -1.321291 0.180958 -0.994851 -0.276255 -0.819886 0.693954 0.596928 0.328566 -0.168327 -0.463682 -0.407658 0.546542 0.501838 0.160602 -0.119224 0.698923 0.336641 -0.057307 -0.040251 -0.070688 -0.320412 -0.133106 -0.513218 0.103101 0.351398 -0.990816 -1.052826 0.124401 -0.495737 -0.059665 -0.161377 -0.344754 0.061336 0.344841 -0.037125 -0.410835 -0.857732 -0.045314 0.241486 -0.485521 0.442395 -0.474608 -0.514554 0.141617 -0.168770 -0.015196 0.706256 0.522127 Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

nathanshartmann commented 6 years ago

Hello again.

These errors occur when a \xa0 char appears. This character also means SPACE but it is not been treated properly.

Because of the sparsity os the issue, I am going to remove the lines you pointed out.

pvcastro commented 6 years ago

Thanks! If we apply your preprocessing (current master) to another corpora, is this error going to happen?

nathanshartmann commented 6 years ago

Just in case this unusual characters happens. I will also update the preprocessing script to fix that.

pvcastro commented 6 years ago

Great, thanks!