jina-ai / serve

☁️ Build multimodal AI applications with cloud-native stack
https://jina.ai/serve
Apache License 2.0
21.13k stars 2.22k forks source link

Crafter does not send chunk but entire text to the Encoder #884

Closed ArturTanona closed 4 years ago

ArturTanona commented 4 years ago

Describe your problem I am playing with jina 0.4.1, basing partially on urbandict-search configuration. My example is already put in the repo: https://github.com/ArturTan/invalid_flow_of_jina

I included my own data (5 long texts) and it seems that encoder does not receive chunks but the entire text.

I used two custom classes for debugging this problem: CustomSenticizer and CustomEncoder that inherit from Sentencizer and TransformerTorchEncoder respectively. They save the input and output date from craft and encode method respectively.

This show:

[input for CustomSenticizer] - string per one invokation of craft function. [output for CustomSenticizer] - dict of sentences and meta infos. [input for CustomEncoder] - 5-dim np.array of long strings [output for CustomSenticizer] - array of shape (5, 28996)

That means that arrays represent entire texts. But not the chunks. It is contrary to this information from jina docs:

This way a single document contains N different Chunks that are later independently encoded by a downstream encoder. This lets Jina query the index using a short sentence as input, where similarity search can be applied to find the most common Chunks. This way the same Document can be retrieved based on searching different parts of it.

What is your guess? Crafter does not send the chunks but the entire text as the input to the Encoder.

Environment

jina                          0.4.1
jina-proto                    0.0.55
jina-vcs-tag                  (unset)
libzmq                        4.3.2
pyzmq                         1.19.1
protobuf                      3.13.0
proto-backend                 cpp
grpcio                        1.31.0
ruamel.yaml                   0.16.10
python                        3.8.2
platform                      Linux
platform-release              4.19.76-linuxkit
platform-version              #1 SMP Tue May 26 11:42:35 UTC 2020
architecture                  x86_64
processor                     x86_64
jina-resources                /usr/local/lib/python3.8/dist-packages/jina/resources
JINA_ARRAY_QUANT              (unset)
JINA_CONTRIB_MODULE           (unset)
JINA_CONTRIB_MODULE_IS_LOADING(unset)
JINA_CONTROL_PORT             (unset)
JINA_DEFAULT_HOST             (unset)
JINA_EXECUTOR_WORKDIR         (unset)
JINA_FULL_CLI                 (unset)
JINA_IPC_SOCK_TMP             (unset)
JINA_LOG_FILE                 (unset)
JINA_LOG_LONG                 (unset)
JINA_LOG_NO_COLOR             (unset)
JINA_LOG_PROFILING            (unset)
JINA_LOG_SSE                  (unset)
JINA_LOG_VERBOSITY            (unset)
JINA_POD_NAME                 (unset)
JINA_PROFILING                (unset)
JINA_SOCKET_HWM               (unset)
JINA_STACK_CONFIG             (unset)
JINA_TEST_CONTAINER           (unset)
JINA_TEST_GPU                 (unset)
JINA_TEST_PRETRAINED          (unset)
JINA_VCS_VERSION              (unset)
JINA_VERSION                  (unset)
JINA_WARN_UNNAMED             (unset)
JINA_BINARY_DELIMITER         (unset)
JINA_DISABLE_UVLOOP           (unset)

┆Issue is synchronized with this Jira Task by Unito

ArturTanona commented 4 years ago

After a small refactor to put it in line with jina 0.5.0 it works like a charm (almost).

nan-wang commented 4 years ago

@ArturTan Thanks for trying out jina! What's the status quo on this issue? With the recursive Document structure, the concept of Chunk is deprecated in v0.5.0. Here is a guide for migration https://github.com/jina-ai/jina/issues/702.

ArturTanona commented 4 years ago

Yep, it works fine. PS. v0.5.0 is a great release! I really appreciate it.