medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
326 stars 59 forks source link

[Text indexation] Unicode errors sometimes when indexing into ES #473

Closed boogheta closed 1 year ago

boogheta commented 1 year ago
2023-04-03 11:21:54,881 worker-3 ERROR prototype-top-80: error in index bulk, batch flag reset
Traceback (most recent call last):
  File "text_indexation.py", line 107, in indexation_task
    raise_on_error=False)
  File "/home/boogheta/.pyenv/versions/hyphe-elastic/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 304, in bulk
    for ok, item in streaming_bulk(client, actions, *args, **kwargs):
  File "/home/boogheta/.pyenv/versions/hyphe-elastic/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 216, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/home/boogheta/.pyenv/versions/hyphe-elastic/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 75, in _chunk_actions
    cur_size += len(data.encode("utf-8")) + 1
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 5699-5700: surrogates not allowed