Open nkidambi opened 1 year ago
Document loading I can solve my self by uploading files to appropriate blob container. Website content extraction needs more specification about supported formats (html, txt, json, etc). Ideally is to provide ready scraper script/function. I my case I have access to database and can extract content via SQL queries, but supported output format is not clear yet.
@dmitri012, the supported document formats are available at https://github.com/microsoft/PubSec-Info-Assistant/blob/main/docs/features/features.md#supported-document-types
I'd really like to see this take priority as it is a requirement of just about every customer my team works with. Usually when we tell them this feature is not available in this repo, they use a different repo that already has this feature and they mis out on all the great work that has gone into this repo.
Feature request Many federal customers have public documents (PDFs) and websites (including FAQs) that they would like to search using Info-Assistant.
Additional Details Support crawling and extracting content from URL/website with recursion up to a certain configurable depth. Also provide support for filtering out certain URLs like forms, pages that call APIs (like office locator and such) and/or certain domains.