instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
25 stars 37 forks source link

resolve_ocr_options() causes RHEL AI sdg with PDF to hang indefinitely #410

Closed relyt0925 closed 33 minutes ago

relyt0925 commented 12 hours ago

I have not been able to get a lower level debug log: but when trying to run sdg on a sample PDF document on RHEL AI: this function will hang indefinitely:

Steps to reproduce: 1) Get on rhel ai and run ilab data generate on a pdf taxonomy. The example I used is here: https://github.com/relyt0925/taxonomy-doclingpoc/tree/main

2) Look at logs: when resolve_ocr_options is ran the process will hang indefinitely at

INFO 2024-11-26 03:22:51,619 instructlab.sdg.utils.taxonomy:147: Processing files...
INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files.
INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
INFO 2024-11-26 03:22:51,622 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf' has 6 pages.
INFO 2024-11-26 03:22:56,486 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
INFO 2024-11-26 03:22:59,815 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-11-26 03:23:01,883 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2024-11-26 03:23:05,050 instructlab.sdg.utils.chunkers:255: Found the docling models

I built a custom image commenting out that section with a custom SDG patch: https://github.com/relyt0925/sdg/commit/08343204e6fda0ae5473f9e99a8b77271ca77bde and then reran it and we are able to get to the point of processing documents

time="2024-11-26T04:09:27Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
INFO 2024-11-26 04:09:29,412 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
INFO 2024-11-26 04:09:29,413 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO 2024-11-26 04:09:29,413 numexpr.utils:161: NumExpr defaulting to 16 threads.
INFO 2024-11-26 04:09:30,512 datasets:59: PyTorch version 2.4.1 available.
INFO 2024-11-26 04:09:32,013 instructlab.data.generate_data:87: Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/root/taxonomy-doclingpoc/' taxonomy, against https://781d2e7c-us-east.lb.appdomain.cloud/v1 server
INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:147: Processing files...
INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files.
INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
INFO 2024-11-26 04:09:32,404 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf' has 6 pages.
INFO 2024-11-26 04:09:37,265 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
INFO 2024-11-26 04:09:40,545 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-11-26 04:09:42,690 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1
INFO 2024-11-26 04:09:45,790 instructlab.sdg.utils.chunkers:255: Found the docling models
INFO 2024-11-26 04:09:46,050 docling.document_converter:202: Going to convert document batch...

my custom test image is quay.io/relyt09250/testinstructlabbuilds:121withsdgpatch

bbrowning commented 4 hours ago

Hmm - a hang is interesting. I suspect you're on a machine that doesn't have a working tesseract install (or at least the command environment ilab is running in doesn't have a working one) and it's falling back to EasyOCR in resolve_ocr_options. EasyOCR will attempt to download model files from their GitHub releases at this point. Do these machines have limited networking connectivity? Perhaps the hang is a firewall or something else hanging EasyOCR's attempts to download its model weights?

relyt0925 commented 1 hour ago

The machines do have limited outbound network connectivity although access to GitHub (over port 443 https connections) are allowed (not ssh based connections)

I would think ultimately it would have failed if it was a network failure and not hanged but I am not totally sure what the exact line of code within the function the program was hanging

bbrowning commented 1 hour ago

And you're certain the process is hung, right? Because when trying to reproduce this on some test machines here, the process actually died with an OSError about a missing system library instead of hanging. In your setup, you'd be able to detect the process crashing differently from hanging?

relyt0925 commented 1 hour ago

Yes: my environment where I got it to hang and the validated this fixed I actually did a custom patch of rhel ai 1.2 because I also saw that library issue on rhel ai 1.3

the patch installed .21 for Instructlab with pip and then in the test that was successful added in my sdg patch on top of rhel ai 1.2

bbrowning commented 41 minutes ago

Ahh, ok - if your system environment was based off of RHEL AI 1.2, then it would make sense that it's falling back to EasyOCR because Tesseract wouldn't be installed and setup to work properly on that system (unless you installed those packages and setup things like TESSDATA_PREFIX yourself). The actual hang is still interesting, but will likely not be something hit in a RHEL AI 1.3 environment.

relyt0925 commented 34 minutes ago

Aha!!!! sounds great thank you!

relyt0925 commented 33 minutes ago

It was invalid of me to patch on RHEL AI 1.2 to try and bring in instructlab .21: and therefore this hang is expected