Bump unstructured from 0.11.8 to 0.14.2

Bumps unstructured from 0.11.8 to 0.14.2.

Release notes

0.14.2

Enhancements

Bump unstructured-inference==0.7.33.

Features

Add attribution to the pinecone connector.

0.14.1

Enhancements

Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.

Features

Large improvements to the ingest process:

Support for multiprocessing and async, with limits for both.

Streamlined to process when mapping CLI invocations to the underlying code

More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)

Use the python client when calling the unstructured api for partitioning or chunking

Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.

Leverage last modified date when deciding if new files should be downloaded and reprocessed.

Add attribution to the pinecone connector

Add support for Python 3.12. unstructured now works with Python 3.12!

0.14.0

BREAKING CHANGES

Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.

Faster evaluation Support for concurrent processing of documents during evaluation

Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.

Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

Add missing starting_page_num param to partition_image

Make the filename and file params for partition_image and partition_pdf match the other partitioners

Fix include_slide_notes and include_page_breaks params in partition_ppt

Re-apply: skip accuracy calculation feature Overwritten by mistake

Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.

Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.14.2

Enhancements

Bump unstructured-inference==0.7.33.

Features

Add attribution to the pinecone connector.

Fixes

0.14.1

Enhancements

Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.

Features

Large improvements to the ingest process:

Support for multiprocessing and async, with limits for both.

Streamlined to process when mapping CLI invocations to the underlying code

More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)

Use the python client when calling the unstructured api for partitioning or chunking

Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.

Leverage last modified date when deciding if new files should be downloaded and reprocessed.

Add attribution to the pinecone connector

Add support for Python 3.12. unstructured now works with Python 3.12!

Fixes

0.14.0

BREAKING CHANGES

Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.

Faster evaluation Support for concurrent processing of documents during evaluation

Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.

Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

... (truncated)

Commits

18428f2 chore: bump unstructured-inference 0.7.33 (#3074)
30e5a0c rfctr(docx): organize docx tests (#3070)
7832dfc feat: add attribution for pinecone (#3067)
b0d8a77 feat: partiton_pdf() set inferred elements text (#3061)
059fc64 build: apk add libreoffice24 (#3065)
3eaf65a feat: refactor ingest (#3009)
73739b3 docs: redirect to docs.unstructured.io on github pages (#3054)
acda4d0 fix: set skip_infer_tables explicitly in `test_partition_via_api_with_no_st...
6066a26 fix: update container link in README.md (#2889)
60f10fe Updated Weaviate Docker image url (auto PR by bot) (#2659)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

microsoft / rag-experiment-accelerator

Bump unstructured from 0.11.8 to 0.14.2 #576

0.14.2

Enhancements

Features

0.14.1

Enhancements

Features

0.14.0

BREAKING CHANGES

Enhancements

Features

Fixes

0.14.2

Enhancements

Features

Fixes

0.14.1

Enhancements

Features

Fixes

0.14.0

BREAKING CHANGES

Enhancements

Features

Fixes