The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.
Fixes
Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
Quick-fix CI error in auto test-filetype. Better fix to follow shortly.
Add (but do not install) a new post-partitioning decorator to handle metadata added for all file-types, like .filename, .filetype and .languages. This will be installed in a closely following PR to replace the four currently being used for this purpose.
Fixes
Update Python SDK usage in partition_via_api. Make a minor syntax change to ensure forward compatibility with the upcoming 0.26.0 Python SDK.
Remove "unused" date_from_file_object parameter. As part of simplifying partitioning parameter set, remove date_from_file_object parameter. A file object does not have a last-modified date attribute so can never give a useful value. When a file-object is used as the document source (such as in Unstructured API) the last-modified date must come from the metadata_last_modified argument.
Fix occasional KeyError when mapping parent ids to hash ids. Occasionally the input elements into assign_and_map_hash_ids can contain duplicated element instances, which lead to error when mapping parent id.
Allow empty text files. Fixes an issue where text files with only white space would fail to be partitioned.
Remove double-decoration for CSV, DOC, ODT partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (CSV and DOCX in this case); remove decoration from delegating partitioners.
Remove double-decoration for PPTX, TSV, XLSX, and XML partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner; remove decoration from delegating partitioners.
Remove double-decoration for HTML, EPUB, MD, ORG, RST, and RTF partitioners. Refactor these partitioners to use the new @apply_metadata() decorator and only decorate the principal partitioner (HTML in this case); remove decoration from delegating partitioners.
Remove obsolete min_partition/max_partition args from TXT and EML. The legacy min_partition and max_partition parameters were an initial rough implementation of chunking but now interfere with chunking and are unused. Remove those parameters from partition_text() and partition_email().
Remove double-decoration on EML and MSG. Refactor these partitioners to rely on the new @apply_metadata() decorator operating on partitioners they delegate to (TXT, HTML, and all others for attachments) and remove direct decoration from EML and MSG.
Remove double-decoration for PPT. Remove decorators from the delegating PPT partitioner.
Quick-fix CI error in auto test-filetype. Better fix to follow shortly.
Commits
6ba376a build(release): release commit for 0.15.14 (#3709)
2f496f8 fix(auto): quick fix for auto test failing in CI (#3715)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Bumps unstructured from 0.15.13 to 0.15.14.
Release notes
Sourced from unstructured's releases.
Changelog
Sourced from unstructured's changelog.
Commits
6ba376a
build(release): release commit for 0.15.14 (#3709)2f496f8
fix(auto): quick fix for auto test failing in CI (#3715)06c8523
rfctr(ppt): remove double-decoration (#3701)27fa2a3
add tests describing the behavior of set_element_hierarchy (#3700)718891a
rfctr(part): remove double-decoration 5 (#3692)4711a8d
rfctr(part): remove double-decoration 4 (#3690)9bd91a8
rfctr(part): remove double-decoration 3 (#3687)1709219
rfctr(part): remove double-decoration 2 (#3686)bba6026
rfctr(part): remove double-decoration 1 (#3685)c05e1ba
rfctr(meta): refine@apply
_metadata() decorator (#3667)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase
.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show