jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
21k stars 2.22k forks source link

Cannot serialize subfield objects properly #6137

Closed oytuntez closed 7 months ago

oytuntez commented 7 months ago

I have 2 docarray objects, one using the other one:

class QuoteFile(BaseDoc):
    quote_file_id: int = None
    texts: DocList[TextDoc] = None
    images: DocList[ImageDoc] = None
class SearchResult(BaseDoc):
    results: DocList[QuoteFile] = None

There are layers of issues with this schema, but some on top of my mind:

When handling QuoteFile schema, the first time QuoteFile.texts definition is correct. Then comes SearchResult and while processing that, the first QuoteFile definition is overwritten by the definition of SearchResult.results (QuoteFile) but this second QuoteFile is missing the type for QuoteFile.texts field. The JSON schema does show this as array, but its type $ref is missing unlike the first time.

If I change SearchResult to use something else, everything is fine again.

SearchResult.results schema looks like this:

"results": {
            "title": "Results",
            "type": "array",
            "items": {
              "$ref": "#/definitions/QuoteFile"
            }
          },

while QuoteFile.texts look like this:

"texts": {
                "title": "Texts",
                "type": "array",
                "items": {}
              },

This works all fine with a single Deployment, but will fail (because of schema cache overwriting behavior somehow) with a Flow with multiple executors.

Generally using DocList fields has been quite problematic with Optional/None fields (for which I already have a fork, will PR to upstream later). What do you think? I've been dealing with these type issues all week.

JoanFM commented 7 months ago

It would be helpful to have the topology of the Flow having the issue.

oytuntez commented 7 months ago

Test case, files:

flow.py
shapes.py
executors/initial-executor/config.yml
executors/initial-executor/executor.py
executors/debug-executor/config.yml
executors/debug-executor/executor.py

Object definitions in ./shapes.py:

class QuoteFile(BaseDoc):
    quote_file_id: int = None
    texts: DocList[TextDoc] = None
    images: DocList[ImageDoc] = None

class SearchResult(BaseDoc):
    results: DocList[QuoteFile] = None

Executor 1 in ./executors/initial-executor/executor.py:

# YAML CONFIG
# jtype: InitialExecutor
# py_modules:
#   - executor.py
from docarray import DocList
from jina import Executor, requests
from shapes import QuoteFile, SearchResult

class InitialExecutor(Executor):
    @requests(on='/index')
    async def index(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return docs

    @requests(on='/search')
    async def search(self, docs: DocList[SearchResult], **_) -> DocList[SearchResult]:
        return docs

Executor 2 in ./executors/debug-executor/executor.py:

# YAML CONFIG
# jtype: DebugExecutor
# py_modules:
#   - executor.py

from docarray import DocList
from jina import Executor, requests

from shapes import QuoteFile, SearchResult

class DebugExecutor(Executor):
    @requests(on='/index')
    def index(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        docs.summary()
        return docs

    @requests(on='/search')
    def search(self, docs: DocList[SearchResult], **_) -> DocList[SearchResult]:
        docs.summary()
        return docs

Flow in ./flow.py:

import os
from jina import Flow

os.environ['JINA_LOG_LEVEL'] = 'DEBUG'
f = (
    Flow(protocol='http')
    .config_gateway(protocol='HTTP', port=54635, title='Document Intelligence')
    .add(name='initial', uses='executors/initial-executor/config.yml')
    .add(name='debug', uses='executors/debug-executor/config.yml')
)

with f:
    f.block()

Example HTTP json request to /index:

POST /index HTTP/1.1
Host: 0.0.0.0:54635
Content-Type: application/json
Content-Length: 531

{
    "data": [
        {
            "id": "999",
            "quote_file_id": "999",
            "process_method": "tms",
            "images": [
                {
                    "url": "https://picsum.photos/536/999"

                }
            ],
            "texts": [
                {
                    "text": "Hello world"
                }
            ],
            "extracted_data": {
                "year": "2025"
            }
        }
    ]
}

Current response:

{
    "data": [
        {
            "id": "999",
            "quote_file_id": 999,
            "images": [
                {
                    "id": "a6143166cdd0190732dafb25c7e47c83"
                }
            ],
            "texts": [
                {
                    "id": "f684a9eaa6bee5eecfe0c72a76d94a34"
                }
            ]
        }
    ],
    "parameters": {},
    "header": {
        "requestId": "50be40c4b70b4ed9b47b5d230e6d98f4",
        "targetExecutor": ""
    }
}

Issue: QuoteFile.texts or QuoteFile.images is not accessible, this can also be seen in the response, only returning id field. It does detect the incoming request contains fields texts and images, but all fields inside is None, not serialized into the DocList[TextDoc] etc.

Beware, in /index call SearchResult is not even used. And if I change the type of SearchResult.results from DocList[QuoteFile] to something else such as DocList[TextDoc], then my /index executors can access QuoteFile.texts data from the incoming request – and response will contain it, like this:

Updated SearchResult object:

class SearchResult(BaseDoc):
    results: DocList[TextDoc] = None # changed from DocList[QuoteFile]

Expected response (or response when SearchResult.results is not DocList[QuoteFile]):

{
    "data": [
        {
            "id": "999",
            "quote_file_id": 999,
            "images": [
                {
                    "id": null,
                    "tensor": null,
                    "bytes_": null,
                    "embedding": null,
                    "url": "https://picsum.photos/536/999"
                }
            ],
            "texts": [
                {
                    "id": null,
                    "text": "Hello world",
                    "bytes_": null,
                    "embedding": null,
                    "url": null
                }
            ]
        }
    ],
    "parameters": {},
    "header": {
        "requestId": "0827b5a26262400eb271b4fc947ccb2f",
        "targetExecutor": ""
    }
}

As you can see, I can now access the QuoteFile.texts and QuoteFile.images objects and see them in response.

JoanFM commented 7 months ago

Minimal reproducible example:

import os

os.environ['JINA_LOG_LEVEL'] = 'DEBUG'
from docarray import DocList, BaseDoc
from docarray.documents.text import TextDoc
from docarray.documents.image import ImageDoc
from jina import Executor, requests, Flow, Deployment

class QuoteFile(BaseDoc):
    quote_file_id: int = None
    texts: DocList[TextDoc] = None
    images: DocList[ImageDoc] = None

class SearchResult(BaseDoc):
    results: DocList[QuoteFile] = None

class InitialExecutor(Executor):

    @requests(on='/search')
    async def search(self, docs: DocList[SearchResult], **kwargs) -> DocList[SearchResult]:
        return docs

f = (
    Flow(protocol='http', port=54635)
        .add(name='initial', uses=InitialExecutor)
)

with f:
    f.block()
curl -X 'POST' \                                                                                                 (arn:aws:eks:us-east-1:253352124568:cluster/jcloud-stage-eks-abcde/default)
  'http://0.0.0.0:54635/search' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "data": [
    {
      "id": "8dcfb23eead3b328df927f92a40d4e73",
      "results": [
        {
            "id": "999",
            "quote_file_id": "999",
            "process_method": "tms",
            "images": [
                {
                    "url": "https://picsum.photos/536/999"

                }
            ],
            "texts": [
                {
                    "text": "Hello world"
                }
            ],
            "extracted_data": {
                "year": "2025"
            }
        }      ]
    }
  ],
  "parameters": {},
  "header": {
    "requestId": "031ac4eb0cb30719c4a85d17c66ad861",
    "targetExecutor": ""
  }
}'

{"data":[{"id":"8dcfb23eead3b328df927f92a40d4e73","results":[{"id":"999","quote_file_id":999,"images":[{"id":"94c6ab885c14f814d1236b565b99d53f"}],"texts":[{"id":"2ee68947bff740ee2da98072e8b10d46"}]}]}],"parameters":{},"header":{"requestId":"031ac4eb0cb30719c4a85d17c66ad861","targetExecutor":""}}
JoanFM commented 7 months ago

May I ask you for the jina, docarray and pydantic versions you are using?

JoanFM commented 7 months ago

Even a more simplified version of the issue:

from docarray import DocList, BaseDoc
from docarray.documents.text import TextDoc
from docarray.documents.image import ImageDoc

class QuoteFile(BaseDoc):
    quote_file_id: int = None
    texts: DocList[TextDoc]
    images: DocList[ImageDoc] = None

class SearchResult(BaseDoc):
    results: DocList[QuoteFile] = None

from jina.serve.runtimes.helper import _create_aux_model_doc_list_to_list
from jina.serve.runtimes.helper import _create_pydantic_model_from_schema

models_created_by_name = {}
SearchResult_exec = _create_aux_model_doc_list_to_list(SearchResult)
SearchResult_gateway = _create_pydantic_model_from_schema(SearchResult_exec.schema(), 'SearchResult',
                                                          models_created_by_name)

QuoteFile_exec_exposed = _create_aux_model_doc_list_to_list(QuoteFile)
QuoteFile_gateway_reconstructed_if_alone = _create_pydantic_model_from_schema(
    QuoteFile_exec_exposed.schema(),
    'QuoteFile',
    {})
QuoteFile_reconstructed_in_gateway_from_Search_results = models_created_by_name['QuoteFile']
textlist = DocList[TextDoc]([TextDoc(text='hey')])
simple_object = QuoteFile(texts=textlist)
print(f'simple_object {simple_object} => {simple_object.to_json()}')
Executor_exposed_object = QuoteFile_exec_exposed(texts=textlist)
print(f'Executor_exposed_object {Executor_exposed_object} => {Executor_exposed_object.to_json()}')
Gateway_reconstructed_if_alone_object = QuoteFile_gateway_reconstructed_if_alone(texts=textlist)
print(f'Gateway_reconstructed_if_alone_object {Gateway_reconstructed_if_alone_object} => {Gateway_reconstructed_if_alone_object.to_json()}')
reconstructed_in_gateway_from_Search_results = QuoteFile_reconstructed_in_gateway_from_Search_results(texts=textlist)
print(f'Gateway_reconstructed_with_search_result {reconstructed_in_gateway_from_Search_results} => {reconstructed_in_gateway_from_Search_results.to_json()}')

Here is shown how the algorithm to reconstruct a Doc object from schema seems to work when directly used, but not when built from the children?

oytuntez commented 7 months ago

pydantic==1.10.14 jina==3.23.2 docarray=0.40 and our minor fork at motaword/docarray