app.py: indexing pngs when nlp selected

jina-ai / cookiecutter-jina

Cookiecutter template for a Jina project

Apache License 2.0

10 stars 2 forks source link

app.py: indexing pngs when nlp selected #2

Closed alexcg1 closed 3 years ago

alexcg1 commented 4 years ago

Describe your problem

With cookiecutter options:

task_type: nlp
index_type: files

It creates app.py with line 26 being

f.index_files('data/**/*.png', batch_size=64, read_mode='rb', size=num_docs)

Since we're dealing with NLP, should the extension be changed to txt or csv?

Why do you think it's happening?

Misconfigured variable in cookiecutter?

Environment

Downloaded latest cookiecutter-jina

alexcg1 commented 3 years ago

@yuanbit I just tested again today and this bug still exists. Could you look into it pls?

alexcg1 commented 3 years ago

In app.py:

    with f:
        f.index_files('data/**/*.png', batch_size=8, read_mode='rb', size=num_docs)

yuanbit commented 3 years ago

@alexcg1 I think this is happening because for nlp you are only supposed to select strings and the extension doesn't matter because f.index_lines(filepath=data_path, batch_size=16, read_mode='r', size=num_docs) in line 32 only needs the file path. I think the only requirements to use cookiecutter for an nlp task is that each document in the data needs to be separated by lines.