Closed oytuntez closed 7 months ago
It would be helpful to have the topology of the Flow having the issue.
Test case, files:
flow.py
shapes.py
executors/initial-executor/config.yml
executors/initial-executor/executor.py
executors/debug-executor/config.yml
executors/debug-executor/executor.py
Object definitions in ./shapes.py:
class QuoteFile(BaseDoc):
quote_file_id: int = None
texts: DocList[TextDoc] = None
images: DocList[ImageDoc] = None
class SearchResult(BaseDoc):
results: DocList[QuoteFile] = None
Executor 1 in ./executors/initial-executor/executor.py
:
# YAML CONFIG
# jtype: InitialExecutor
# py_modules:
# - executor.py
from docarray import DocList
from jina import Executor, requests
from shapes import QuoteFile, SearchResult
class InitialExecutor(Executor):
@requests(on='/index')
async def index(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
return docs
@requests(on='/search')
async def search(self, docs: DocList[SearchResult], **_) -> DocList[SearchResult]:
return docs
Executor 2 in ./executors/debug-executor/executor.py
:
# YAML CONFIG
# jtype: DebugExecutor
# py_modules:
# - executor.py
from docarray import DocList
from jina import Executor, requests
from shapes import QuoteFile, SearchResult
class DebugExecutor(Executor):
@requests(on='/index')
def index(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
docs.summary()
return docs
@requests(on='/search')
def search(self, docs: DocList[SearchResult], **_) -> DocList[SearchResult]:
docs.summary()
return docs
Flow in ./flow.py
:
import os
from jina import Flow
os.environ['JINA_LOG_LEVEL'] = 'DEBUG'
f = (
Flow(protocol='http')
.config_gateway(protocol='HTTP', port=54635, title='Document Intelligence')
.add(name='initial', uses='executors/initial-executor/config.yml')
.add(name='debug', uses='executors/debug-executor/config.yml')
)
with f:
f.block()
Example HTTP json request to /index
:
POST /index HTTP/1.1
Host: 0.0.0.0:54635
Content-Type: application/json
Content-Length: 531
{
"data": [
{
"id": "999",
"quote_file_id": "999",
"process_method": "tms",
"images": [
{
"url": "https://picsum.photos/536/999"
}
],
"texts": [
{
"text": "Hello world"
}
],
"extracted_data": {
"year": "2025"
}
}
]
}
Current response:
{
"data": [
{
"id": "999",
"quote_file_id": 999,
"images": [
{
"id": "a6143166cdd0190732dafb25c7e47c83"
}
],
"texts": [
{
"id": "f684a9eaa6bee5eecfe0c72a76d94a34"
}
]
}
],
"parameters": {},
"header": {
"requestId": "50be40c4b70b4ed9b47b5d230e6d98f4",
"targetExecutor": ""
}
}
Issue: QuoteFile.texts
or QuoteFile.images
is not accessible, this can also be seen in the response, only returning id
field. It does detect the incoming request contains fields texts
and images
, but all fields inside is None
, not serialized into the DocList[TextDoc] etc.
Beware, in /index
call SearchResult
is not even used. And if I change the type of SearchResult.results
from DocList[QuoteFile]
to something else such as DocList[TextDoc]
, then my /index
executors can access QuoteFile.texts
data from the incoming request – and response will contain it, like this:
Updated SearchResult object:
class SearchResult(BaseDoc):
results: DocList[TextDoc] = None # changed from DocList[QuoteFile]
Expected response (or response when SearchResult.results is not DocList[QuoteFile]):
{
"data": [
{
"id": "999",
"quote_file_id": 999,
"images": [
{
"id": null,
"tensor": null,
"bytes_": null,
"embedding": null,
"url": "https://picsum.photos/536/999"
}
],
"texts": [
{
"id": null,
"text": "Hello world",
"bytes_": null,
"embedding": null,
"url": null
}
]
}
],
"parameters": {},
"header": {
"requestId": "0827b5a26262400eb271b4fc947ccb2f",
"targetExecutor": ""
}
}
As you can see, I can now access the QuoteFile.texts
and QuoteFile.images
objects and see them in response.
Minimal reproducible example:
import os
os.environ['JINA_LOG_LEVEL'] = 'DEBUG'
from docarray import DocList, BaseDoc
from docarray.documents.text import TextDoc
from docarray.documents.image import ImageDoc
from jina import Executor, requests, Flow, Deployment
class QuoteFile(BaseDoc):
quote_file_id: int = None
texts: DocList[TextDoc] = None
images: DocList[ImageDoc] = None
class SearchResult(BaseDoc):
results: DocList[QuoteFile] = None
class InitialExecutor(Executor):
@requests(on='/search')
async def search(self, docs: DocList[SearchResult], **kwargs) -> DocList[SearchResult]:
return docs
f = (
Flow(protocol='http', port=54635)
.add(name='initial', uses=InitialExecutor)
)
with f:
f.block()
curl -X 'POST' \ (arn:aws:eks:us-east-1:253352124568:cluster/jcloud-stage-eks-abcde/default)
'http://0.0.0.0:54635/search' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"data": [
{
"id": "8dcfb23eead3b328df927f92a40d4e73",
"results": [
{
"id": "999",
"quote_file_id": "999",
"process_method": "tms",
"images": [
{
"url": "https://picsum.photos/536/999"
}
],
"texts": [
{
"text": "Hello world"
}
],
"extracted_data": {
"year": "2025"
}
} ]
}
],
"parameters": {},
"header": {
"requestId": "031ac4eb0cb30719c4a85d17c66ad861",
"targetExecutor": ""
}
}'
{"data":[{"id":"8dcfb23eead3b328df927f92a40d4e73","results":[{"id":"999","quote_file_id":999,"images":[{"id":"94c6ab885c14f814d1236b565b99d53f"}],"texts":[{"id":"2ee68947bff740ee2da98072e8b10d46"}]}]}],"parameters":{},"header":{"requestId":"031ac4eb0cb30719c4a85d17c66ad861","targetExecutor":""}}
May I ask you for the jina
, docarray
and pydantic
versions you are using?
Even a more simplified version of the issue:
from docarray import DocList, BaseDoc
from docarray.documents.text import TextDoc
from docarray.documents.image import ImageDoc
class QuoteFile(BaseDoc):
quote_file_id: int = None
texts: DocList[TextDoc]
images: DocList[ImageDoc] = None
class SearchResult(BaseDoc):
results: DocList[QuoteFile] = None
from jina.serve.runtimes.helper import _create_aux_model_doc_list_to_list
from jina.serve.runtimes.helper import _create_pydantic_model_from_schema
models_created_by_name = {}
SearchResult_exec = _create_aux_model_doc_list_to_list(SearchResult)
SearchResult_gateway = _create_pydantic_model_from_schema(SearchResult_exec.schema(), 'SearchResult',
models_created_by_name)
QuoteFile_exec_exposed = _create_aux_model_doc_list_to_list(QuoteFile)
QuoteFile_gateway_reconstructed_if_alone = _create_pydantic_model_from_schema(
QuoteFile_exec_exposed.schema(),
'QuoteFile',
{})
QuoteFile_reconstructed_in_gateway_from_Search_results = models_created_by_name['QuoteFile']
textlist = DocList[TextDoc]([TextDoc(text='hey')])
simple_object = QuoteFile(texts=textlist)
print(f'simple_object {simple_object} => {simple_object.to_json()}')
Executor_exposed_object = QuoteFile_exec_exposed(texts=textlist)
print(f'Executor_exposed_object {Executor_exposed_object} => {Executor_exposed_object.to_json()}')
Gateway_reconstructed_if_alone_object = QuoteFile_gateway_reconstructed_if_alone(texts=textlist)
print(f'Gateway_reconstructed_if_alone_object {Gateway_reconstructed_if_alone_object} => {Gateway_reconstructed_if_alone_object.to_json()}')
reconstructed_in_gateway_from_Search_results = QuoteFile_reconstructed_in_gateway_from_Search_results(texts=textlist)
print(f'Gateway_reconstructed_with_search_result {reconstructed_in_gateway_from_Search_results} => {reconstructed_in_gateway_from_Search_results.to_json()}')
Here is shown how the algorithm to reconstruct a Doc object from schema seems to work when directly used, but not when built from the children?
pydantic==1.10.14 jina==3.23.2 docarray=0.40 and our minor fork at motaword/docarray
I have 2 docarray objects, one using the other one:
There are layers of issues with this schema, but some on top of my mind:
When handling QuoteFile schema, the first time
QuoteFile.texts
definition is correct. Then comes SearchResult and while processing that, the first QuoteFile definition is overwritten by the definition ofSearchResult.results (QuoteFile)
but this second QuoteFile is missing the type forQuoteFile.texts
field. The JSON schema does show this as array, but its type $ref is missing unlike the first time.If I change SearchResult to use something else, everything is fine again.
SearchResult.results
schema looks like this:while
QuoteFile.texts
look like this:This works all fine with a single
Deployment
, but will fail (because of schema cache overwriting behavior somehow) with a Flow with multiple executors.Generally using DocList fields has been quite problematic with Optional/None fields (for which I already have a fork, will PR to upstream later). What do you think? I've been dealing with these type issues all week.