Bump unstructured from 0.15.13 to 0.16.3

Bumps unstructured from 0.15.13 to 0.16.3.

Release notes

0.16.3

Enhancements

Features

Fixes

V2 elements without first parent ID can be parsed

Fix missing elements when layout element parsed in V2 ontology

updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

Enhancements

Features

Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

Enhancements

Bump unstructured-inference to 0.7.39 and upgrade other dependencies

Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.

Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

Parsing HTML to Unstructured Elements and back

Fixes

Remove unsupported chipper model

Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.

Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

Enhancements

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.16.3

Enhancements

Features

Fixes

V2 elements without first parent ID can be parsed

Fix missing elements when layout element parsed in V2 ontology

updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

Enhancements

Features

Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

Enhancements

Bump unstructured-inference to 0.7.39 and upgrade other dependencies

Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.

Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

Parsing HTML to Unstructured Elements and back

Fixes

Remove unsupported chipper model

Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.

Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.

Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

Enhancements

... (truncated)

Commits

340a07f [Merge] release to 0.16.3 (#3755)
5a91f0c Fix layout parsing (#3754)
2417f8e Fix when parent id is none for first element in v2 notion: (#3752)
9835fe4 set version 0.16.2 (#3748)
aa5935b Ml 384/whitespaces in cct (#3747)
bdfcc14 fix: fix partition_via_api retry mechanism when the default SDK's retry confi...
0b4c72a Set version to 0.16.1 (#3745)
03a3ed8 Add parsing HTML to unstructured elements (#3732)
6bceac1 feat: expose retry params in partition via api (#3724)
a11ad22 bump unstructured-inference (#3711)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

microsoft / rag-experiment-accelerator

Bump unstructured from 0.15.13 to 0.16.3 #799

0.16.3

Enhancements

Features

Fixes

0.16.2

Enhancements

Features

Fixes

0.16.1

Enhancements

Features

Fixes

0.16.0

Enhancements

0.16.3

Enhancements

Features

Fixes

0.16.2

Enhancements

Features

Fixes

0.16.1

Enhancements

Features

Fixes

0.16.0

Enhancements