oughtinc / ice

Interactive Composition Explorer: a debugger for compositional language model programs
https://ice.ought.org
MIT License
529 stars 66 forks source link

Unable to parse PDFs, "Failed to resolve 'test.elicit.org'" #327

Open mathcass opened 3 months ago

mathcass commented 3 months ago

When trying to run the "Loading paper text" chapter from the Primer, I run into an error indicating that it can't find "test.elicit.org". Since paper.parse_pdf depends on this remote resource to parse the PDF, it can't proceed at all.

Here's a full trace of what I see:

Full trace python recipes/paper_hello.py --paper papers/keenan-2018.pdf /home/cass/src/ice/venv/lib/python3.11/site-packages/pydantic/_migration.py:283: UserWarning: `pydantic.generics:GenericModel` has been moved to `pydantic.BaseModel`. warnings.warn(f'`{import_path}` has been moved to `{new_location}`.') /home/cass/src/ice/venv/lib/python3.11/site-packages/pydantic/_internal/_config.py:334: UserWarning: Valid config keys have changed in V2: * 'keep_untouched' has been renamed to 'ignored_types' * 'fields' has been removed warnings.warn(message, UserWarning) Traceback (most recent call last): File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn sock = connection.create_connection( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/socket.py", line 961, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ socket.gaierror: [Errno -2] Name or service not known The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request raise new_e File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn conn.connect() File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 616, in connect self.sock = sock = self._new_conn() ^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: : Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/adapters.py", line 589, in send resp = conn.urlopen( ^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='test.elicit.org', port=443): Max retries exceeded with url: /elicit-previews/james/oug-3083-support-parsing-arbitrary-pdfs-using/parse_pdf (Caused by NameResolutionError(": Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known)")) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/cass/src/ice/recipes/paper_hello.py", line 10, in recipe.main(answer_for_paper) File "/home/cass/src/ice/ice/recipe.py", line 176, in main defopt.run( File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 348, in run call = bind( ^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 255, in bind call, rest = _bind_or_bind_known(*args, _known=False, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/defopt.py", line 203, in _bind_or_bind_known args, rest = parser.parse_args(argv), [] ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1862, in parse_args args, argv = self.parse_known_args(args, namespace) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1895, in parse_known_args namespace, args = self._parse_known_args(args, namespace) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2103, in _parse_known_args start_index = consume_optional(start_index) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2043, in consume_optional take_action(action, args, option_string) File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 1955, in take_action argument_values = self._get_values(action, argument_strings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2485, in _get_values value = self._get_value(action, arg_string) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/.pyenv/versions/3.11.0/lib/python3.11/argparse.py", line 2518, in _get_value result = type_func(arg_string) ^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/ice/recipe.py", line 181, in Paper: lambda path: Paper.load(Path(path)), ^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/ice/paper.py", line 158, in load paragraph_dicts = parse_pdf(file) ^^^^^^^^^^^^^^^ File "/home/cass/src/ice/ice/cache.py", line 28, in sync_wrapper result = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/ice/paper.py", line 119, in parse_pdf r = requests.post(PDF_PARSER_URL, files=files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cass/src/ice/venv/lib/python3.11/site-packages/requests/adapters.py", line 622, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='test.elicit.org', port=443): Max retries exceeded with url: /elicit-previews/james/oug-3083-support-parsing-arbitrary-pdfs-using/parse_pdf (Caused by NameResolutionError(": Failed to resolve 'test.elicit.org' ([Errno -2] Name or service not known)"))

❓ Is there an alternative that folks recommend for PDF parsing here?

TommyBark commented 2 months ago

I have quick-fixed this here https://github.com/TommyBark/ice/tree/fix-parse_pdf by using pdfminer.six package. The semantic chunking is not very reliable as it is done based on html parsing and not all pdfs work nicely with it, but it works as proof of concept for the Factored Cognition Primer examples.