This should work in feature/issue-39-media-processor-xpath.
Presently, the fallback is noted as a bool flag name_fallback_used in the entities results array.
I did originally make it write out a warning json e.g. documents-pdf-name-not-matched.json but went the flag route at the end. Separate json rather than the flag may be nicer though depending on how we think errors are going to be usefully collated/processed/easier to notice etc. Anyone have strong feelings what it should do?
You can use this config to test the Type/Media and Processor/Media
---
domain: https://www.inslm.gov.au
urls:
- /submissions/statutory-deadline-reviews
entity_type: basic_page
mappings:
-
field: alias
type: alias
-
field: title
selector: //*[@id="page-title"]
type: text
processors:
nl2br: { }
-
field: field_body
selector: '//*[@class="grid-12 region region-content"]'
type: long_text
processors:
-
processor: media
type: document-basic_page
selector: //a[contains(@type, 'application/pdf')]
file: ./@href
name: ./@alt #text() # Use text() for working, something else to test no name
# selector: //img
# file: ./@src
# name: ./@alt
xpath: true
data_entity_embed_display: view_mode:media.embed
data_embed_button: media_entity_embed
-
field: field_pdf_list
type: ordered
selector: '//td[contains(@class, "views-field-field-attachment")]'
available_items:
-
by:
attr: class
text: "views-field-field-attachment"
fields:
-
field: download_list
type: paragraph
paragraph_type: document_pdf_list
children:
-
field: field_paragraph_title
type: static_value
options:
value: Downloads
-
field: download_items
type: paragraph
paragraph_type: document_pdf
children:
-
field: field_file_attachment
selector: './descendant::span[contains(@class, "file")]'
type: media
options:
file: ./a/@href
name: ./a/@alt #./a/text() # Use text() for working, something else to test no name
type: documents-pdf
xpath: true
I think having this output to the separate "reporting" json file is what we want here, ideally we should keep a logical split between data results and reporting/logging.
I've tested this locally and all works fine when using fallback, however see bizarre results with non-fallback (e.g using this basic config:)
---
domain: https://www.inslm.gov.au
urls:
- /submissions/statutory-deadline-reviews
entity_type: basic_page
mappings:
-
field: alias
type: alias
-
field: title
selector: //*[@id="page-title"]
type: text
processors:
nl2br: { }
-
field: field_body
selector: '//*[@class="grid-12 region region-content"]'
type: long_text
processors:
-
processor: media
type: document-basic_page
selector: //a[contains(@type, 'application/pdf')]
file: ./@href
name: ./text()
xpath: true
data_entity_embed_display: view_mode:media.embed
data_embed_button: media_entity_embed
Conversation from #45:
I think having this output to the separate "reporting" json file is what we want here, ideally we should keep a logical split between data results and reporting/logging.
I've tested this locally and all works fine when using fallback, however see bizarre results with non-fallback (e.g using this basic config:)
The result contains an empty object for name: