microsoft / rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
https://github.com/microsoft/rag-experiment-accelerator
Other
197 stars 72 forks source link

Bump unstructured from 0.15.13 to 0.16.3 #799

Closed dependabot[bot] closed 3 weeks ago

dependabot[bot] commented 1 month ago

Bumps unstructured from 0.15.13 to 0.16.3.

Release notes

Sourced from unstructured's releases.

0.16.3

Enhancements

Features

Fixes

  • V2 elements without first parent ID can be parsed
  • Fix missing elements when layout element parsed in V2 ontology
  • updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

Enhancements

Features

  • Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

  • Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

Enhancements

  • Bump unstructured-inference to 0.7.39 and upgrade other dependencies
  • Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
  • Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

  • Parsing HTML to Unstructured Elements and back

Fixes

  • Remove unsupported chipper model
  • Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
  • Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
  • Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

Enhancements

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.16.3

Enhancements

Features

Fixes

  • V2 elements without first parent ID can be parsed
  • Fix missing elements when layout element parsed in V2 ontology
  • updated unstructured-inference to be 0.8.1 in requirements/extra-pdf-image.in

0.16.2

Enhancements

Features

  • Whitespace-invariant CCT distance metric. CCT Levenshtein distance for strings is by default computed with standardized whitespaces.

Fixes

  • Fixed retry config settings for partition_via_api function If the SDK's default retry config is not set the retry config getter function does not fail anymore.

0.16.1

Enhancements

  • Bump unstructured-inference to 0.7.39 and upgrade other dependencies
  • Round coordinates Round coordinates when computing bounding box overlaps in pdfminer_processing.py to nearest machine precision. This can help reduce underterministic behavior from machine precision that affects which bounding boxes to combine.
  • Request retry parameters in partition_via_api function. Expose retry-mechanism related parameters in the partition_via_api function to allow users to configure the retry behavior of the API requests.

Features

  • Parsing HTML to Unstructured Elements and back

Fixes

  • Remove unsupported chipper model
  • Rewrite of partition.email module and tests. Use modern Python stdlib email module interface to parse email messages and attachments. This change shortens and simplifies the code, and makes it more robust and maintainable. Several historical problems were remedied in the process.
  • Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.
  • Minify text_as_html from XLSX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from CSV. Previously .metadata.text_as_html for CSV tables was "bloated" with whitespace and noise elements introduced by pandas that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text.
  • Minify text_as_html from PPTX. Previously .metadata.text_as_html for PPTX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count while preserving all text and structure.

0.16.0

Enhancements

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
dependabot[bot] commented 3 weeks ago

Superseded by #810.