helixml / helix

Multi-node production AI stack. Run the best of open source AI easily on your own servers. Create your own AI by fine-tuning open source models. Integrate LLMs with APIs. Run gptscript securely on the server
https://tryhelix.ai
Other
316 stars 22 forks source link

check URLs have text #58

Open binocarlos opened 8 months ago

binocarlos commented 8 months ago

some URLs are just javascript and break unstructured - we need a better error: https://www.reuters.com/legal/colorado-ballot-case-adds-fuel-trumps-nomination-drive-2023-12-20/

lukemarsden commented 8 months ago

yeah, also if we don't get any text then don't start the finetuning process - just ask the user to paste the text in instead "sorry we couldn't extract any text from that URL, please copy and paste it instead"

lukemarsden commented 7 months ago

I tried to train on a published notion page and it extracted no data at all, Mixtral then hallucinated loads of qapairs about photosynthesis and deep learning and other random shit. We should catch this before we even start training - if there's no text in the training set, just throw an error and ask the user to add more documents or report the issue extracting text from the given documents/URLs to us.