isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

curationResponsibility doesn't roundtrip through export service back to re-imported solr index #401

Open dannymandel opened 3 months ago

dannymandel commented 3 months ago

It looks like curationResponsibility is missing from the regenerated solr index on my local Mac. Upon a little digging, it isn't present in the exported .jsonl file on henry.

Record id: http://localhost:8984/solr/isb_core_records/select?indent=true&q.op=OR&q=id%3A%22ark%3A%2F65665%2F300008335-8d74-4c3f-873c-a9d8b4b3d6a8%22&useParams=

"curation_responsibility":"USNM http://grbio.org/cool/142r-0w94",

curl "https://henry.cyverse.org/smithsonian/sitemaps/sitemap-0.jsonl" | grep -i a9d8b4b3d6a8

{"sample_identifier": "ark:/65665/300008335-8d74-4c3f-873c-a9d8b4b3d6a8", "label": "Bathymodiolus sp. AJ9VQ03", "description": "basisOfRecord: MaterialSample | occurrenceRemarks: Order: 2885; Box Number: MBARI_0036; Box Position: F/5; MBARI Note: SIO Box 6; Sinatra | catalogNumber: USNM 1464106 | recordNumber: A3120-(B3-5) | fieldNumber: AL-3120 | type: PhysicalObject | individualCount: 1 | disposition: in collection | startDayOfYear: 191 | endDayOfYear: 191", "source_collection": "SMITHSONIAN", "has_specimen_category": [{"identifier": "https://w3id.org/isample/vocabulary/specimentype/1.0/organismpart"}], "has_material_category": [{"identifier": "https://w3id.org/isample/vocabulary/material/1.0/organicmaterial"}], "has_context_category": [{"identifier": "https://w3id.org/isample/vocabulary/sampledfeature/1.0/marinewaterbodybottom"}], "informal_classification": ["Bathymodiolus sp."], "keywords": [{"keyword": "Animalia"}, {"keyword": "Bathymodiolus sp."}, {"keyword": "Bivalvia"}, {"keyword": "IZ"}, {"keyword": "Mollusca"}, {"keyword": "Mytilidae"}, {"keyword": "North Atlantic Ocean"}], "produced_by": {"identifier": "", "label": "", "result_time": "1997-07-10", "has_feature_of_interest": "", "sampling_site": {"description": "verbatimLatitude: 37-17.629N | verbatimLongitude: 32-16.532W", "label": "MID-ATLANTIC RIDGE - Lucky Strike", "place_name": ["MID-ATLANTIC RIDGE - Lucky Strike", "North Atlantic Ocean"], "sample_location": {"elevation": "", "latitude": 37.2938, "longitude": -32.2755}}, "responsibility": [{"role": "recordedBy", "name": " R. Vrijenhoek et al."}]}}
dannymandel commented 3 months ago

It looks like the problem is this code:

    def _add_responsibilities_to_container(self,
                                           rec: dict,
                                           responsibility_key_solr: str,
                                           responsibility_key: str,
                                           container: dict):
        responsibilities = rec.get(responsibility_key_solr, [])
        responsibility_dicts = []
        for responsibility in responsibilities:
            pieces = responsibility.split(":")
            responsibility_dicts.append({METADATA_ROLE: pieces[0], METADATA_NAME: pieces[1]})
        if len(responsibility_dicts) > 0:
            container[responsibility_key] = responsibility_dicts

    def _curation_dict(self, rec: dict) -> dict:
        curation_dict: dict = {}
        self._add_to_dict(curation_dict, METADATA_LABEL, rec, SOLR_CURATION_LABEL)
        self._add_to_dict(curation_dict, METADATA_DESCRIPTION, rec, SOLR_CURATION_DESCRIPTION)
        self._add_to_dict(curation_dict, METADATA_CURATION_LOCATION, rec, SOLR_CURATION_LOCATION)
        self._add_responsibilities_to_container(rec, SOLR_CURATION_RESPONSIBILITY, METADATA_RESPONSIBILITY, curation_dict)
        access_constraints = rec.get(SOLR_CURATION_ACCESS_CONSTRAINTS, "").split("|")
        if len(access_constraints) > 0:
            curation_dict[METADATA_ACCESS_CONSTRAINTS] = access_constraints
        return curation_dict

Note that it's trying to split the string value based one a role:value format, and this record doesn't conform to that. It's unclear what we should do in this case.

dannymandel commented 3 months ago

It looks like the value in the solr index doesn't match the current code. The current code is doing this:

    def curation_responsibility(self) -> list[dict[str, str]]:
        curation_str = f"{self.source_record.get('institutionCode')} {self.source_record.get('institutionID')}"
        return [Transformer._responsibility_dict("curator", curation_str)]

And the responsibility dict just does:

    @staticmethod
    def _responsibility_dict(
        role: str, name: str
    ):
        return {METADATA_ROLE: role, METADATA_NAME: name}

So in the solr index it should be:

curator: "USNM http://grbio.org/cool/142r-0w94"

dannymandel commented 2 months ago

So I think we can default the role to curator if it's a naked string.