The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.
Features
Large improvements to the ingest process:
Support for multiprocessing and async, with limits for both.
Streamlined to process when mapping CLI invocations to the underlying code
More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
Use the python client when calling the unstructured api for partitioning or chunking
Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
Leverage last modified date when deciding if new files should be downloaded and reprocessed.
Add attribution to the pinecone connector
Add support for Python 3.12. unstructured now works with Python 3.12!
0.14.0
BREAKING CHANGES
Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.
Enhancements
Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
Faster evaluation Support for concurrent processing of documents during evaluation
Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.
Features
Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.
Fixes
Add missing starting_page_num param to partition_image
Make the filename and file params for partition_image and partition_pdf match the other partitioners
Fix include_slide_notes and include_page_breaks params in partition_ppt
Re-apply: skip accuracy calculation feature Overwritten by mistake
Fix type hint for paragraph_grouper paramparagraph_grouper can be set to False, but the type hint did not not reflect this previously.
Remove links param from partition_pdflinks is extracted during partitioning and is not needed as a paramter in partition_pdf.
Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.
Features
Large improvements to the ingest process:
Support for multiprocessing and async, with limits for both.
Streamlined to process when mapping CLI invocations to the underlying code
More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
Use the python client when calling the unstructured api for partitioning or chunking
Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
Leverage last modified date when deciding if new files should be downloaded and reprocessed.
Add attribution to the pinecone connector
Add support for Python 3.12. unstructured now works with Python 3.12!
Fixes
0.14.0
BREAKING CHANGES
Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.
Enhancements
Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
Faster evaluation Support for concurrent processing of documents during evaluation
Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.
Features
Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Bumps unstructured from 0.11.8 to 0.14.2.
Release notes
Sourced from unstructured's releases.
... (truncated)
Changelog
Sourced from unstructured's changelog.
... (truncated)
Commits
18428f2
chore: bump unstructured-inference 0.7.33 (#3074)30e5a0c
rfctr(docx): organize docx tests (#3070)7832dfc
feat: add attribution for pinecone (#3067)b0d8a77
feat:partiton_pdf()
set inferred elements text (#3061)059fc64
build: apk add libreoffice24 (#3065)3eaf65a
feat: refactor ingest (#3009)73739b3
docs: redirect to docs.unstructured.io on github pages (#3054)acda4d0
fix: setskip_infer_tables
explicitly in `test_partition_via_api_with_no_st...6066a26
fix: update container link in README.md (#2889)60f10fe
Updated Weaviate Docker image url (auto PR by bot) (#2659)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase
.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show