htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Allow retaining JSON-LD column names #47

Open bmschmidt opened 2 years ago

bmschmidt commented 2 years ago

I have a personal preference here for using the LOD versions of the column names. For example, use htBibUrl rather than ht_bib_url. The rationale is that as of EF 2.0, these are LOD identifiers, not just column names.

Would be nice to have an option--or even default--to use those.

Related is whether the serial number within books should be called seq (as in the json-ld) or renamed page (as in HTRCFR). I hear that the Google METS data may be leaving Hathi, which opens up the possibility that actual page numbers (like, the numbers on the corners of the book) might get out at some point.

I have no idea why seq is a string like '000000001' instead of an integer.

One option would be to use the original LOD names internally, and to move the renaming from the json parsing to the last handoff. This way old pandas code would keep working, but raw representations could use the LOD names.

organisciak commented 2 years ago

The intent of this library is Python scaffolding for working with EF. Part of that is following Python convention, including PEP 8, which expects lower_case_with_underscores for method names and variables. https://peps.python.org/pep-0008/#method-names-and-instance-variables

I understand your motivation for camelCase, but it doesn't seem like a strong enough case to justify the work and potential compatibility issues associated with a deviation from the original design decisions.

Regarding seq, that's a question for @borice. I often cast to int, but there was some reason lost to my memory as to why it's a string to begin with.

bmschmidt commented 2 years ago

Agreed on full PEP compliance: my thought though is that these aren't actually method names or variables. Or is your thinking that because pandas columns are often accessed with syntax like df.ht_bib_url, the PEP 8 rules should apply? (I assume that there must be some pandas-specific conventions out there).

I am thinking that if there are underpowered arrow methods, those would return the original linked data names, while pandas frames would return the PEP compliant names preserving back-compatibility. This of course would make it slightly harder to turn old pandas code into new arrow code.