koursaros-ai / nboost

NBoost is a scalable, search-api-boosting platform for deploying transformer models to improve the relevance of search results on different platforms (i.e. Elasticsearch)
Apache License 2.0
674 stars 69 forks source link

nboost-index in docker container fails with `'ascii' codec can't decode byte 0xe2` #93

Open marcinczeczko opened 3 years ago

marcinczeczko commented 3 years ago

Hi, I tried to run both elastic & nboost as docker containers as follows

docker run -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.4.2 docker run -d -p 8000:8000 koursaros/nboost:latest-pt --uhost host.docker.internal --uport 9200

However, when I tried to index travel.csv within the container it fails with the error docker exec -it <nboost-container-nameorid> nboost-index --host=host.docker.internal --file /opt/conda/lib/python3.6/site-packages/nboost/resources/travel.csv --index_name travel --delim ,

I:ESIndexer:[es.:ind: 29]:Setting up Elasticsearch index...
I:ESIndexer:[es.:ind: 32]:Creating index travel...
I:ESIndexer:[es.:ind: 37]:Indexing /opt/conda/lib/python3.6/site-packages/nboost/resources/travel.csv...
I:ESIndexer:[bas:csv: 59]:Estimating completion size...
Traceback (most recent call last):
  File "/opt/conda/bin/nboost-index", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/cli.py", line 47, in main
    indexer(**args).index()
  File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/es.py", line 39, in index
    bulk(elastic, actions=act)
  File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 310, in bulk
    for ok, item in streaming_bulk(client, actions, *args, **kwargs):
  File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 222, in streaming_bulk
    actions, chunk_size, max_chunk_bytes, client.transport.serializer
  File "/opt/conda/lib/python3.6/site-packages/elasticsearch/helpers/actions.py", line 73, in _chunk_actions
    for action, data in actions:
  File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/es.py", line 38, in <genexpr>
    act = (self.format(passage, cid=cid) for cid, passage in self.csv_generator())
  File "/opt/conda/lib/python3.6/site-packages/nboost/indexers/base.py", line 60, in csv_generator
    num_lines = count_lines(path)
  File "/opt/conda/lib/python3.6/site-packages/nboost/helpers.py", line 117, in count_lines
    count = sum(1 for _ in fileobj)
  File "/opt/conda/lib/python3.6/site-packages/nboost/helpers.py", line 117, in <genexpr>
    count = sum(1 for _ in fileobj)
  File "/opt/conda/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 839: ordinal not in range(128)

But, If I run nboost on my macos the indexer for the same file works just fine.

Any ideas, what might went wrong ?

nbroad1881 commented 3 years ago

Check what the default and preferred encodings are.

Run a python repl and the following lines

import sys
import locale

print(sys.getdefaultencoding())
print(locale.getpreferredencoding())

Compare your MacOS to Docker. I still haven't figured out why, but it doesn't give me the error when I manually set the encoding using this:

with open(filename, 'r', encoding='encoding_name') as f:

You'll have to modify helpers.py and maybe one other file.

I'd try something else because this repo looks abandoned