isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

OpenContext project data suddenly gone missing in solr #340

Closed dannymandel closed 9 months ago

dannymandel commented 9 months ago

When I cut over to the new solr index, I found that the following two OpenContext projects seem to have disappeared:

Avkat Archaeological Project
Giza Botanical Database

It's unclear where they went.

dannymandel commented 9 months ago

Per discussion with @ekansa, this is expected. @ekansa, if you agree with this assessment could you kindly close the issue? Thank you!

ekansa commented 9 months ago

Hm! Interesting, we should have some records from those projects in Open Context.

Here's the base GET request to the Open Context API used by iSamples:

https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2a&response=metadata,uri-meta&sort=updated--desc,context--asc&type=subjects&rows=100

If I add a filter to limit by records in the Giza Botanical Database, the URL would be: https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2a&response=metadata,uri-meta&sort=updated--desc,context--asc&type=subjects&rows=100&proj=131-giza-botanical-database

That returns a response with (paged) JSON for 34459 records.

Similarly, for Avkat:

https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2a&response=metadata,uri-meta&sort=updated--desc,context--asc&type=subjects&rows=100&proj=117-avkat-archaeological-project

That returns a response to page through 43448 records.

So the Open Context API should be giving records for those projects. I wonder if there's something missing or unexpected in the records for these projects which causes them to be passed over in isamples_inabox?

dannymandel commented 9 months ago

Thanks @ekansa! It’s possible we have the records but the search is broken. I’ll need to investigate.

ekansa commented 9 months ago

I also experimented and requested that the API returned some additional information on facet counts:

https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2a&response=prop-facet,metadata,uri-meta&sort=updated--desc,context--asc&type=subjects&rows=100

The result is expected, where there are facets for the Avkat and the Giza Botanical projects that have the expected counts.

{
  "id": "https://opencontext.org/query/?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2A&proj=117-avkat-archaeological-project&response=prop-facet%2Cmetadata%2Curi-meta&rows=100&type=subjects",
  "json": "https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2A&proj=117-avkat-archaeological-project&response=prop-facet%2Cmetadata%2Curi-meta&rows=100&type=subjects",
  "rdfs:isDefinedBy": "https://opencontext.org/projects/02b55e8c-e9b1-49e5-8edf-0afeea10e2be",
  "slug": "117-avkat-archaeological-project",
  "label": "Avkat Archaeological Project",
  "count": 43448
},

and

{
  "id": "https://opencontext.org/query/?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2A&proj=131-giza-botanical-database&response=prop-facet%2Cmetadata%2Curi-meta&rows=100&type=subjects",
  "json": "https://opencontext.org/query/.json?attributes=iSamples&cat=oc-gen-cat-sample-col%7C%7Coc-gen-cat-bio-subj-ecofact%7C%7Coc-gen-cat-object&cursorMark=%2A&proj=131-giza-botanical-database&response=prop-facet%2Cmetadata%2Curi-meta&rows=100&type=subjects",
  "rdfs:isDefinedBy": "https://opencontext.org/projects/10aa84ad-c5de-4e79-89ce-d83b75ed72b5",
  "slug": "131-giza-botanical-database",
  "label": "Giza Botanical Database",
  "count": 34459
},

So we may need to dig into how iSamples processes records from these projects.

ekansa commented 9 months ago

If I do a keyword search for the data authors, I see their records in iSamples central:

(Giza related):

https://central.isample.xyz/isamples_central/ui/#/?searchFields=JOzBRy8m48J_IkHpa59wGRsLnv67IbGXeGWjzePQMkp8tbH15V-TTIY1aptVs3Ex-pz5S4HI7-gJA6FiEAijaeagL_r5d8WGz4yfda2-D2YCx-eNg4ro2KGOlLFN_9auym8rM--oEJ2-93R1crQZcLuE0MDBNDepVe4xuRF1o9THI6nKWyRKThRk-rxuDYaAqT3zwt4nVX-csVTkGbKQPrZ3jq6_592DF4JtvzO75iCwWvqmBVD34-Lj3HGn7nRRFf8UXMgzMQncwrMvs9MgkbOt

and (Avkat related):

https://central.isample.xyz/isamples_central/ui/#/?searchFields=JKzTYy8m4Fmh9SyLUleYlfqVhrMq70TIP4ss6bYRacozok5_lqqzgq_PcTdCpEv_5S4HIStK9r72s75IMIGJLAhwGPI0uFMv44r8lZKe73_G5QYDCCUVCtYZxkdBSE68KBDrTR84nZV3nZlJQZJpsy5ZQ8aBUy4jr2Ty1HZvClAH6rMWCOPTRFtroFkOPfbWwFxnE1jFXZBxNerK6UiCLxZMwxS0ie47w_xTESyYsDfW9voM_j04sQs00gR3ORkdqpDGhJUhVB7EYzLteSfx-GS0

So it looks like you have the data indexed?

dannymandel commented 9 months ago

Thanks Eric! That’s super helpful. I suspect something broke as part of our new metadata format and we lost the project info in the iSamples index.

dannymandel commented 9 months ago

So I think this is the issue:

    def produced_by_label(self) -> str:
        return self.source_record.get("project label", Transformer.NOT_PROVIDED)

And when I look at one of the records from "Avkat Archaeological Project", I see this:

    "project":
    {
        "id": "http://opencontext.org/projects/02b55e8c-e9b1-49e5-8edf-0afeea10e2be",
        "label": "Avkat Archaeological Project"
    }

so my guess is the format of the JSON changed but our transformer didn't keep up here, and this is why we are missing the data in solr. Similarly I think we need to update this:

    def produced_by_description(self) -> str:
        return self.source_record.get("project href", Transformer.NOT_PROVIDED)

to return the id key out of the project dictionary.