instructlab / community

InstructLab Community wide collaboration space including contributing, security, code of conduct, etc
Apache License 2.0
70 stars 40 forks source link

Support for other unstructured data like pdf, word etc., #309

Open gnthaker opened 3 months ago

gnthaker commented 3 months ago

It would be good if we can provide PDF or other unstructured data from which we can generate synthetic data.

lhawthorn commented 3 months ago

Thank you for filing this issue, @gnthaker! Helps us keep track of it. We discussed in last night's triage meeting that we desire to also create better tooling for PDF --> Markdown conversion and generally make data ingestion a less cumberson process. As we are moving fast and a young project, I am not sure where on our roadmap this will land timing-wise.

Once again, thank you for filing this issue and assuring we don't lose track of this clear need.

lhawthorn commented 3 months ago

The people in the community who I know who have talked the most about this need are on the Triage team. If you want to talk to them about scoping this work, you can find them in #triage on InstructLab Slack.

jjasghar commented 3 months ago

Yep, @gnthaker, please reach out; we have some thoughts and suggestions to get something off the ground, but nothing formalized in a pipeline or anything.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.