nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.15k stars 113 forks source link

Add status check before parsing the json string #74

Closed oreh closed 2 months ago

oreh commented 2 months ago

This is related to the issue https://github.com/nlmatics/llmsherpa/issues/64.

The root cause of this issue is that the client does not check the response status and try to parse the non-json content returned from the server. Since this status error is not directly shown to users, users only see the json parsing exception.

With this patch, users will see the raw content from the server

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <timed exec>:5

File ~/miniconda3/envs/py11/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:73, in LayoutPDFReader.read_pdf(self, path_or_url, contents)
     71 parser_response = self._parse_pdf(pdf_file)
     72 if parser_response.status > 200:
---> 73     raise ValueError(f"{[parser_response.data](http://parser_response.data/)}")
     74 response_json = json.loads(parser_response.data.decode("utf-8"))
     75 blocks = response_json['return_dict']['result']['blocks']

ValueError: b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

Instead of the implicit one:

--------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File <timed exec>:5

File ~/miniconda3/envs/py11/lib/python3.11/site-packages/llmsherpa/readers/file_reader.py:72, in LayoutPDFReader.read_pdf(self, path_or_url, contents)
     70             pdf_file = (file_name, file_data, 'application/pdf')
     71 parser_response = self._parse_pdf(pdf_file)
---> 72 response_json = json.loads(parser_response.data.decode("utf-8"))
     73 blocks = response_json['return_dict']['result']['blocks']
     74 return Document(blocks)

File ~/miniconda3/envs/py11/lib/python3.11/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder

File ~/miniconda3/envs/py11/lib/python3.11/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    332 def decode(self, s, _w=WHITESPACE.match):
    333     """Return the Python representation of ``s`` (a ``str`` instance
    334     containing a JSON document).
    335 
    336     """
--> 337     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338     end = _w(s, end).end()
    339     if end != len(s):

File ~/miniconda3/envs/py11/lib/python3.11/json/decoder.py:355, in JSONDecoder.raw_decode(self, s, idx)
    353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:
--> 355     raise JSONDecodeError("Expecting value", s, err.value) from None
    356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
akashsonowal commented 2 weeks ago

Hi @ansukla, @oreh

It is good to have the visibility but the server parser is still an issue. Any idea on that? From user standpoint, it is hard to correct pdfs as it can be a large volume of docs.