iterative / datachain

AI-data warehouse to enrich, transform and analyze data from cloud storages
https://docs.datachain.ai
Apache License 2.0
901 stars 52 forks source link

UDFs are hard to debug #106

Closed shcheklein closed 1 month ago

shcheklein commented 3 months ago

When you run a code like:

def pdf_chunks(file: File) -> Iterator[Chunk]:

    chunks = []
    if len(chunks) > 3:
        # Mind this line, it is causing an obvious IndexError
        print(chunks[100000])

dc = (
    DataChain.from_storage(source)
    .filter(C.name.glob("*.pdf"))
    .gen(document=pdf_chunks)
)

dc

it leads to something like this:

Traceback (most recent call last):
  File "<string>", line 48, in <module>
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 1771, in query_wrapper
    _send_result(dataset_query)
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 1720, in _send_result
    preview = preview_query.to_records()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 1293, in to_records
    return self.results(lambda cols, row: dict(zip(cols, row)))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/lib/dc.py", line 564, in results
    return list(rows)
           ^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/lib/dc.py", line 563, in <genexpr>
    rows = (row_factory(db_signals, r) for r in rows)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/lib/dc.py", line 554, in iterate_flatten
    with super().select(*db_signals).as_iterable() as rows:
  File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 1236, in as_iterable
    query = self.apply_steps().select()
            ^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 1179, in apply_steps
    result = step.apply(
             ^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 631, in apply
    self.populate_udf_table(udf_table, query)
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 549, in populate_udf_table
    process_udf_outputs(
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/dataset.py", line 399, in process_udf_outputs
    for row in udf_output:
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/query/udf.py", line 147, in <genexpr>
    return (dict(zip(self.signal_names, row)) for row in results)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ivan/Projects/pdf-datachain-demo/.venv/lib/python3.12/site-packages/datachain/lib/udf.py", line 204, in <genexpr>
    res = (
          ^
  File "<string>", line 34, in pdf_chunks
IndexError: list index out of range
dmpetrov commented 3 months ago

No easy way to set a breakpoint inside (?)

It works in a single thread

shcheklein commented 1 month ago

Closing in favor of https://github.com/iterative/datachain/issues/360 - that should resolve most of the issues here. And we can come back to this after that.