instructlab / sdg

Python library for Synthetic Data Generation
https://pypi.org/project/instructlab-sdg/
Apache License 2.0
22 stars 35 forks source link

Empty dataset error kills the workflow #240

Closed aakankshaduggal closed 1 month ago

aakankshaduggal commented 3 months ago

When the qna.yaml is not appropriate or a wrong model is used, the generation fails to happen and throws an error -- instructlab.sdg.pipeline.EmptyDatasetError: Pipeline stopped: Empty dataset after running pipe

Proposed solution:

relyt0925 commented 2 months ago

+1 to this: I noticed this as well

marceloleitner commented 1 month ago

It is getting better with this patch but it would be nicer if it could have some hint on possible reasons. Like, "please ensure the number of examples is enough.", "please make sure it attends the guidelines at HTTP", or something like that. You will know better.

What I know is that I just spent a day debugging this issue. I could only understand the reason after I found the issue that led to this MR, https://github.com/instructlab/sdg/issues/240

bbrowning commented 1 month ago

@marceloleitner Those are reasonable suggestions, although I'd ask that perhaps that be a separate issue because that's less about handling the case of a dataset being empty without crashing and more a request for better logging when something fails during the generation giving a user more indication of what potential causes of that type of failure may be.