sul-cidr / patent_data_extractor

1 stars 3 forks source link

Nomenclature Changes #4

Open simonwiles opened 4 years ago

simonwiles commented 4 years ago

@gmoore016

  1. Changing the main branch for the repo: I'd like to change the name of the branch from mastermain -- I'm going through and doing this for all my repos, but it might cause a issue with your fork, and I don't want to create extra problems for you. If you're willing to update the remote-tracking origins on your fork, I can send you command line instructions. Other options include making sure your additions are merged upstream and then nuking your repo and re-forking, or just leaving it as it is, for now at least.

  2. Changing some of the keywords in the YAML config: Now that these affordances are maturing a little, the initial keywords I chose don't seem terribly helpful, and I was thinking it might be good to replace them with something a bit more semantically useful. Something like this, perhaps?

    • enum_typeconstant_value?
    • enum_mapvalue_map?

    Or perhaps you have better ideas? Suggestions or other feedback very welcome!

  3. Distinguishing keys (XPath expressions) from keywords: Looking at the YAML files I think they'd be more readable if the (burgeoning vocabulary of) keywords were easier to distinguish from the XPath expressions that serve as the keys to the XML files.

    • __keyword__?
    • KEYWORD?
    • _keyword?

    I think I'm leaning towards __KEYWORD__, actually, so that something like

    ### Continuations
    subdoc-bibliographic-information/continuity-data/continuations/continuation-of/parent-child/parent:
      entity: related_document
      fields:
        # These values are fixed for all related documents found at this location
        document-id:
          - fieldname: type
            enum_type: continuation
          - fieldname: relationship
            enum_type: parent
    
        # These vary from record to record
        document-id/kind-code: kind_code
        document-id/doc-number: doc_number
        document-id/document-date: date
        document-id/country-code: country_code
        parent-status: status

    would become:

    ### Continuations
    subdoc-bibliographic-information/continuity-data/continuations/continuation-of/parent-child/parent:
      __ENTITY__: related_document
      __FIELDS__:
        # These values are fixed for all related documents found at this location
        document-id:
          - __FIELDNAME__: type
            __CONSTANT_VALUE__: continuation
          - __FIELDNAME__: relationship
            __CONSTANT_VALUE__: parent
    
        # These vary from record to record
        document-id/kind-code: kind_code
        document-id/doc-number: doc_number
        document-id/document-date: date
        document-id/country-code: country_code
        parent-status: status

    I'd welcome your input on whether this represents an improvement or not!

Feel free to respond here or on email if you have anything you'd like to comment on.

Thanks!

gmoore016 commented 4 years ago

@simonwiles

  1. Changing the main branch for the repo: That all sounds good to me! If you don't mind sending along the command line instructions, I don't see this being too much of an issue.

  2. Changing some of the keywords in the YAML config: This also sounds like a really good idea to me. The only other keyword I might change would be changing pk to something like primary_key just for the sake of descriptiveness, but I'm more than happy to follow your judgement on the details. Ping me with a list of conversions and I can bring the config files up-to-date.

  3. Distinguishing keys from keywords: I like your suggestion of __KEYWORD__, since as you say it very nicely distinguishes the keywords from the paths. I think this is superior to just KEYWORD as the earlier XML paths are also CAP_SNAKE, so we need something else for the keywords to distinguish themselves.

I'll start working on converting the current config files to use the YAML list format you discuss in your email; let me know when you resolve the above and I'll update them again with those changes.

simonwiles commented 4 years ago

Okay, cool, I've changed master to main on this repo, thanks. To do the same on your fork, the easiest way in this case is like this:

$ git branch -m master main
$ git push -u origin main

Then you'll just need to change the default branch on GitHub and delete the now-superfluous master -- you can do that here: https://github.com/gmoore016/patent_data_extractor/branches.

I've raised a PR for changing the notation for the keywords here: #7. In the end I decided I didn't like double-underscores, mainly because they meant that the actual text of the keyword lines up with indented keys on the next line, which didn't help with readability as much as I'd hoped. The PR uses <angle brackets> to surround keywords. I think this works -- one advantage is that angle brackets can't be used in XML tagnames, by definition. However, it's a PR because I wanted to give you a chance to consider it before I merge it in. Feedback is welcome here or there.

Apart from the pk<primary_key> change I've not altered other keywords yet. I'm considering changing the functionality of the enum_map affordance to make it a bit more generalizable, and it might impact the best choice of name.

gmoore016 commented 4 years ago

Alright, my fork should be all up to date.

That sounds good to me! Just give me a heads up with your changes and I can update the config files. I commented with the all-clear on #7 as well!