salsadigitalauorg / merlin-framework

Merlin - migration framework
GNU General Public License v3.0
16 stars 3 forks source link

Fixes #39 #46

Closed derklempner closed 5 years ago

stooit commented 5 years ago

Conversation from #45:

This should work in feature/issue-39-media-processor-xpath.

Presently, the fallback is noted as a bool flag name_fallback_used in the entities results array.

I did originally make it write out a warning json e.g. documents-pdf-name-not-matched.json but went the flag route at the end. Separate json rather than the flag may be nicer though depending on how we think errors are going to be usefully collated/processed/easier to notice etc. Anyone have strong feelings what it should do?

You can use this config to test the Type/Media and Processor/Media

---
domain: https://www.inslm.gov.au

urls:
  - /submissions/statutory-deadline-reviews

entity_type: basic_page
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: //*[@id="page-title"]
    type: text
    processors:
      nl2br: { }
  -
    field: field_body
    selector: '//*[@class="grid-12 region region-content"]'
    type: long_text
    processors:
      -
        processor: media
        type: document-basic_page
        selector: //a[contains(@type, 'application/pdf')]
        file: ./@href
        name: ./@alt #text() # Use text() for working, something else to test no name
        # selector: //img
        # file: ./@src
        # name: ./@alt
        xpath: true
        data_entity_embed_display: view_mode:media.embed
        data_embed_button: media_entity_embed
  -
    field: field_pdf_list
    type: ordered
    selector: '//td[contains(@class, "views-field-field-attachment")]'
    available_items:
      -
        by:
          attr: class
          text: "views-field-field-attachment"
        fields:
          -
            field: download_list
            type: paragraph
            paragraph_type: document_pdf_list
            children:
              -
                field: field_paragraph_title
                type: static_value
                options:
                  value: Downloads
              -
                field: download_items
                type: paragraph
                paragraph_type: document_pdf
                children:
                  -
                    field: field_file_attachment
                    selector: './descendant::span[contains(@class, "file")]'
                    type: media
                    options:
                      file: ./a/@href
                      name: ./a/@alt #./a/text() # Use text() for working, something else to test no name
                      type: documents-pdf
                      xpath: true    

I think having this output to the separate "reporting" json file is what we want here, ideally we should keep a logical split between data results and reporting/logging.

I've tested this locally and all works fine when using fallback, however see bizarre results with non-fallback (e.g using this basic config:)

---
domain: https://www.inslm.gov.au

urls:
  - /submissions/statutory-deadline-reviews

entity_type: basic_page
mappings:
  -
    field: alias
    type: alias
  -
    field: title
    selector: //*[@id="page-title"]
    type: text
    processors:
      nl2br: { }
  -
    field: field_body
    selector: '//*[@class="grid-12 region region-content"]'
    type: long_text
    processors:
      -
        processor: media
        type: document-basic_page
        selector: //a[contains(@type, 'application/pdf')]
        file: ./@href
        name: ./text()
        xpath: true
        data_entity_embed_display: view_mode:media.embed
        data_embed_button: media_entity_embed

The result contains an empty object for name:

{
    "data": [
        {
            "name": {},
            "file": "https:\/\/www.inslm.gov.au\/sites\/default\/files\/submissions\/submission-rule-of-law-institute-australia.pdf",
            "uuid": "e39a3ebf-9b28-30e5-ba20-4d82ad3ca1df",
            "alt": null,
            "name_fallback_used": false
        },
        ...