Closed pvcastro closed 6 years ago
Thank you very much for you analyze. I will update our models as soon as possible and make them available.
Best regards.
Nathan Siegle Hartmann
Em 9 de mai de 2018, à(s) 06:49, Pedro Vitor Quinta de Castro notifications@github.com escreveu:
Hi there! Great job on these embeddings!
I was using them to train a NER model, and I realized that some of the lines are in an invalid format, such as containing two tokens before the word vector, or maybe not token at all before the vector. Here are some examples I collected (I'm attaching all of them in this issue):
FastText:
skip_s100_invalid_lines.txt
linha 2618: 0,0 km² 1.3471 -0.89272 0.24963 -0.43362 1.4181 -1.1619 0.75171 0.20959 0.18636 0.65931 1.1528 -0.29139 1.3373 -0.11006 -0.72911 -0.79949 -0.76637 0.043631 -0.15634 0.60977 0.58492 1.057 -1.804 -0.34065 0.068916 -1.2928 0.76522 0.11291 -0.23761 -1.0092 0.52692 1.0372 -0.34624 -0.76549 -1.0159 0.41543 -1.128 0.32626 0.50076 0.73435 -0.92034 -0.42143 1.7997 0.72186 0.36473 1.139 -1.1148 1.7712 -0.12962 -1.0424 -0.27582 -0.045895 0.069847 0.013625 -0.55709 -0.16886 -0.022214 0.50913 0.7321 -0.57073 0.095712 -0.11376 0.18697 0.55433 0.072265 -0.080268 0.99282 0.16483 0.22592 0.53295 -1.3126 0.023292 0.88148 1.3048 -0.6718 -0.89945 0.58785 -0.97323 0.34167 1.8696 -0.68135 0.076409 0.62791 -0.40515 0.3237 0.72617 -0.41143 0.036656 0.24826 -0.81591 0.41292 0.537 0.20369 0.80803 -0.75 0.98709 -0.50751 -0.6739 -0.29898 -0.23988 linha 3906: ã s 0.31366 -0.46417 -0.2101 0.6402 1.0168 0.62714 0.94433 -0.79981 0.52926 -0.067065 0.99118 0.17664 -0.22477 -0.93925 -0.28902 -0.09314 -0.52883 0.13341 -0.25957 0.80455 0.84604 0.081927 0.087947 -0.40282 -1.2475 -0.90543 0.50472 -0.263 -0.69297 0.31247 -0.47517 0.46997 -0.12649 -0.58831 0.61016 0.53584 -0.83699 0.32526 0.081563 0.40073 -1.5383 0.43729 0.45408 -1.1574 0.35063 0.19737 0.70146 0.62286 0.10975 -0.47485 0.47548 0.25552 0.89394 0.54781 0.1901 0.30729 -0.63491 0.3869 0.39235 0.19513 0.33399 -0.52563 -0.04087 -0.23115 0.20164 0.28505 0.62356 0.046043 -0.32291 -0.75999 0.58555 0.56726 0.76853 0.21236 -0.79027 0.3673 0.71262 0.018268 -0.37527 0.002464 0.0016207 -0.20957 -0.082549 -0.2598 -0.61734 1.0905 0.38158 -0.050387 0.064086 -0.13071 -0.73415 -0.29921 0.97657 0.43263 -0.1172 0.22682 0.052972 -0.17632 -0.13901 -0.65764 Glove:
glove_s100_invalid_lines.txt
linha 4311: 00 km -1.096497 -0.115348 0.533725 -0.780431 0.811165 0.062700 0.358095 -0.824477 -0.150925 0.983429 0.943246 1.464778 0.209856 -0.249053 -1.210661 0.569873 0.791388 0.942387 0.548097 -1.028815 0.333276 0.095697 0.329763 -1.049485 0.201676 0.260916 -0.576804 0.575832 -0.641334 0.775280 1.265637 0.630137 1.094773 -0.172359 -0.183386 0.695177 0.185336 0.106616 0.611617 -0.290683 0.593486 -0.200387 -0.373662 0.566315 -0.257140 -0.448474 -0.523373 -1.374823 0.407349 1.321078 0.361554 -0.825517 0.677784 -0.156055 0.175524 -0.560491 0.011153 -0.284771 -0.653020 0.376460 0.052716 -0.398949 -0.523565 -0.430178 0.311725 0.773853 -0.093564 -0.140165 -0.746921 -0.350721 0.895749 -0.025408 0.265322 0.758847 0.895762 -0.341492 0.380499 0.127606 -0.437591 0.639467 0.029006 -0.166302 0.237734 0.484939 1.342966 -0.384358 -0.333631 -0.612093 0.747059 0.286568 0.118164 -0.534590 -0.258652 0.258338 0.038046 0.224927 0.346821 -0.026168 1.198875 0.945606 linha 5357: -0.291527 -0.207144 -0.194693 0.296796 0.291955 -0.057791 -0.345879 0.186292 0.172007 0.560378 -0.280491 -0.027786 0.275376 -0.028554 -0.811503 0.139032 0.309938 0.034639 0.220700 0.491905 -0.324429 0.379279 0.049063 0.139276 0.133582 0.069576 0.034449 -0.756578 0.040279 0.264051 -0.363349 0.542291 -0.076136 0.032225 -0.087384 -0.030874 0.693043 -0.068561 -0.045448 -0.084627 -0.121269 0.321651 -0.134468 -0.026799 0.143538 -0.638123 -0.279734 0.277754 -0.478550 -0.189353 0.435826 0.066824 -0.076424 0.411868 -0.221163 -0.035606 0.013376 -0.115047 0.012623 0.229935 0.253806 -0.206941 0.260670 0.143976 -0.074355 -0.009542 0.193227 0.147870 0.319481 -0.292567 -0.569469 -0.088245 0.121289 -0.168956 0.472040 0.140808 0.109296 -0.338164 -0.227513 0.313779 -0.117033 -0.416819 -0.511378 -0.216577 0.739741 -0.078641 -0.034593 0.021032 0.163312 0.190173 -0.521095 0.129220 -0.104790 0.265647 0.140339 -0.431014 0.510587 -0.005205 -0.969191 0.232102 Wang2Vec:
skip_s100_invalid_lines.txt
linha 7846: 00,00000000 km/s 0.240993 -0.942322 1.600320 2.077391 -0.437721 1.297975 -0.812284 -1.310465 -1.348463 0.607501 -1.628980 -3.357586 0.441017 -1.969249 -0.532445 -0.779277 1.145346 1.323124 0.036780 -0.931161 0.165157 1.136547 2.136952 -4.218977 -1.187829 0.170832 1.432486 0.803935 1.366924 0.308831 1.272064 -1.100312 -2.382427 0.523305 1.807923 -2.486763 -0.183749 0.611948 0.744579 -1.310416 -0.284388 3.821928 -0.909492 -1.045277 0.271173 -1.197620 -2.247871 -1.635551 -2.900197 -1.622078 0.620783 2.011719 0.018628 -0.637541 -0.697962 -2.199550 -0.254782 -2.114877 -1.237467 -1.303796 2.050109 -0.162425 1.135828 -0.057633 2.631514 1.300285 1.556403 -2.677551 0.374641 -0.242459 -0.457558 -1.584399 2.818794 0.216729 -1.010701 -1.776988 -0.405578 -0.661738 -1.394717 0.366224 -1.088951 1.938241 -0.616742 -0.179179 -0.179525 2.822714 0.170322 -0.649997 0.168230 -1.441733 0.483268 -1.246915 -2.006533 0.208874 -2.711892 -1.008742 3.018411 -0.081341 1.063304 2.420181 linha 9243: 00,0 km² 1.651085 -1.506577 1.520257 -0.205966 -1.928894 0.998100 2.195940 -0.961068 0.351457 0.482292 -0.151572 -1.338510 -0.463367 -0.693778 -0.650503 -0.406332 2.583741 0.993141 2.598100 -1.047690 -1.740937 0.072660 2.609116 -1.648594 -0.082490 -1.155029 2.213265 -1.073674 3.068192 -0.721511 -1.001577 -1.458915 -2.073348 0.981863 -0.553089 -0.040057 -1.389118 0.078736 2.475482 -1.140503 0.143878 -0.893640 -1.510324 -2.826583 0.114413 -1.218856 -0.990987 1.237282 -1.965299 -0.283821 0.757455 0.554541 -0.913564 -0.599824 2.179213 0.811821 -0.490595 -1.395128 -0.619808 -1.689423 0.831838 0.506574 0.977351 -0.003260 -0.375976 1.611319 0.006866 0.595251 1.554341 0.082847 -0.714754 -1.541869 0.210436 -0.917407 -0.883670 -1.164431 -1.184115 -0.660099 -1.314602 -0.042511 1.074625 0.165777 -1.497031 -0.666364 2.369030 -1.166566 0.089672 0.908849 1.599464 -0.711212 -1.640689 1.568981 -2.217598 0.966219 -0.091045 -0.697727 -0.870052 2.419719 -0.822139 0.927472 Word2Vec:
skip_s100_invalid_lines.txt
linha 18596: -0.028997 0.013575 0.062561 -0.201263 -0.583117 -0.217628 0.022837 -0.170598 0.099084 -0.651933 0.033815 -0.245818 0.435562 0.058953 -0.119798 0.205226 -0.429767 -0.250009 -0.015448 -0.083469 0.485238 -0.224591 0.545159 0.212638 -0.339153 -0.132941 0.010644 0.101276 0.096178 0.422906 -0.239319 -0.023387 -0.321550 -0.085336 -0.013101 0.046811 -0.502934 -0.369628 0.077045 -0.023889 0.454306 -0.282537 0.106553 0.081499 -0.335621 -0.191669 -0.016532 -0.067348 0.511277 0.104629 -0.328300 0.163664 0.811827 0.046643 0.122906 -0.051964 -0.240036 0.149031 -0.235158 0.615142 0.008031 0.149618 -0.531543 0.414220 -0.211937 -0.213367 0.070715 -0.171367 0.223777 0.147169 -0.412582 0.242935 0.454417 0.268225 -0.066188 -0.282406 0.455986 0.129857 -0.169192 -0.357411 -0.368528 -0.555435 0.152939 0.075229 -0.079797 -0.420233 0.153397 0.177482 0.295766 0.456925 -0.625738 -0.558847 0.046555 -0.111168 -0.629639 -0.048618 0.046491 0.378669 0.363404 0.432986 linha 19084: 00 000 0.155477 0.865527 -0.188853 0.765310 0.871128 0.516933 -0.470242 -0.320178 -0.308708 0.139145 0.258087 -0.124555 0.800621 0.327414 -0.105571 -0.645862 -1.363663 1.377966 0.763156 -0.074988 0.254229 -0.252865 0.510145 -0.462393 -0.405541 0.564130 0.197429 -0.015470 0.018540 -0.859302 0.207488 -0.903036 -0.877624 0.850438 -0.263961 0.003113 0.523423 -0.314306 0.051861 -0.332538 1.166323 -0.309925 0.077705 0.063017 0.586330 -0.287491 -0.029791 0.508337 -0.685614 -0.494796 0.514873 0.109300 -1.321291 0.180958 -0.994851 -0.276255 -0.819886 0.693954 0.596928 0.328566 -0.168327 -0.463682 -0.407658 0.546542 0.501838 0.160602 -0.119224 0.698923 0.336641 -0.057307 -0.040251 -0.070688 -0.320412 -0.133106 -0.513218 0.103101 0.351398 -0.990816 -1.052826 0.124401 -0.495737 -0.059665 -0.161377 -0.344754 0.061336 0.344841 -0.037125 -0.410835 -0.857732 -0.045314 0.241486 -0.485521 0.442395 -0.474608 -0.514554 0.141617 -0.168770 -0.015196 0.706256 0.522127 Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hello again.
These errors occur when a \xa0 char appears. This character also means SPACE but it is not been treated properly.
Because of the sparsity os the issue, I am going to remove the lines you pointed out.
Thanks! If we apply your preprocessing (current master) to another corpora, is this error going to happen?
Just in case this unusual characters happens. I will also update the preprocessing script to fix that.
Great, thanks!
Hi there! Great job on these embeddings!
I was using them to train a NER model, and I realized that some of the lines are in an invalid format, such as containing two tokens before the word vector, or maybe not token at all before the vector. Here are some examples I collected (I'm attaching all of them in this issue):
FastText:
skip_s100_invalid_lines.txt
Glove:
glove_s100_invalid_lines.txt
Wang2Vec:
skip_s100_invalid_lines.txt
Word2Vec:
skip_s100_invalid_lines.txt
Thanks!