nlmatics / llmsherpa

Developer APIs to Accelerate LLM Projects
https://www.nlmatics.com
MIT License
1.42k stars 140 forks source link

pdf_reader.read_pdf(pdf_url) cannot read local pdf path #102

Open Tizzzzy opened 3 months ago

Tizzzzy commented 3 months ago

Hi, When I am trying to run this code:

from llmsherpa.readers import LayoutPDFReader

# llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
llmsherpa_api_url = "http://localhost:5001/api/parseDocument?renderFormat=all&useNewIndentParser=true"
pdf_url = "C:/Users/super/OneDrive/Desktop/....pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

If I set pdf_url to my local path, it will give me this error:

---------------------------------------------------------------------------
LocationValueError                        Traceback (most recent call last)
Cell In[35], [line 7](vscode-notebook-cell:?execution_count=35&line=7)
      [5](vscode-notebook-cell:?execution_count=35&line=5) pdf_url = "C:/Users/super/OneDrive/Desktop/vertisim_ai/BidSmart/code/context_aware_parse/nlm-ingestor/qt625910hd_noSplash_7a7f8e7e4ab806cd0a32fe4adde0cf28.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
      [6](vscode-notebook-cell:?execution_count=35&line=6) pdf_reader = LayoutPDFReader(llmsherpa_api_url)
----> [7](vscode-notebook-cell:?execution_count=35&line=7) doc = pdf_reader.read_pdf(pdf_url)

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\llmsherpa\readers\file_reader.py:65, in LayoutPDFReader.read_pdf(self, path_or_url, contents)
     [63](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:63) is_url = urlparse(path_or_url).scheme != ""
     [64](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:64) if is_url:
---> [65](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:65)     pdf_file = self._download_pdf(path_or_url)
     [66](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:66) else:
     [67](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:67)     file_name = os.path.basename(path_or_url)

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\llmsherpa\readers\file_reader.py:36, in LayoutPDFReader._download_pdf(self, pdf_url)
     [34](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:34) # add authorization headers if using external API (see upload_pdf for an example)
     [35](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:35) download_headers = {"User-Agent": user_agent}
---> [36](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:36) download_response = self.download_connection.request("GET", pdf_url, headers=download_headers)
     [37](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:37) file_name = os.path.basename(urlparse(pdf_url).path)
     [38](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/llmsherpa/readers/file_reader.py:38) # note you can change the file name here if you'd like to something else

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\urllib3\request.py:74, in RequestMethods.request(self, method, url, fields, headers, **urlopen_kw)
     [71](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:71) urlopen_kw["request_url"] = url
     [73](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:73) if method in self._encode_url_methods:
---> [74](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:74)     return self.request_encode_url(
     [75](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:75)         method, url, fields=fields, headers=headers, **urlopen_kw
     [76](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:76)     )
     [77](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:77) else:
     [78](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:78)     return self.request_encode_body(
     [79](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:79)         method, url, fields=fields, headers=headers, **urlopen_kw
     [80](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:80)     )

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\urllib3\request.py:96, in RequestMethods.request_encode_url(self, method, url, fields, headers, **urlopen_kw)
     [93](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:93) if fields:
     [94](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:94)     url += "?" + urlencode(fields)
---> [96](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/request.py:96) return self.urlopen(method, url, **extra_kw)

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\urllib3\poolmanager.py:364, in PoolManager.urlopen(self, method, url, redirect, **kw)
    [361](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:361) u = parse_url(url)
    [362](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:362) self._validate_proxy_scheme_url_selection(u.scheme)
--> [364](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:364) conn = self.connection_from_host(u.host, port=u.port, scheme=u.scheme)
    [366](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:366) kw["assert_same_host"] = False
    [367](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:367) kw["redirect"] = False

File c:\Users\super\anaconda3\envs\nlm-ingestor\lib\site-packages\urllib3\poolmanager.py:236, in PoolManager.connection_from_host(self, host, port, scheme, pool_kwargs)
    [225](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:225) """
    [226](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:226) Get a :class:`urllib3.connectionpool.ConnectionPool` based on the host, port, and scheme.
    [227](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:227) 
   (...)
    [232](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:232) needed.
    [233](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:233) """
    [235](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:235) if not host:
--> [236](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:236)     raise LocationValueError("No host specified.")
    [238](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:238) request_context = self._merge_pool_kwargs(pool_kwargs)
    [239](file:///C:/Users/super/anaconda3/envs/nlm-ingestor/lib/site-packages/urllib3/poolmanager.py:239) request_context["scheme"] = scheme or "http"

LocationValueError: No host specified.
shanshanRT commented 2 months ago

Having the same issue

muhammad-ammar12 commented 2 months ago

Same issue!! Do you have any one with the solution?

amitsaini8445 commented 2 months ago

use the relative pdf path