Update Edurep connection

fako commented 1 year ago

As of July 13 2023 this ticket is placed "onhold". See comments at the bottom to see how we could continue again.

Most comments on this story are a bit old and only here for future reference. They are not necessarily relevant before March 9th.

Edurep wants us to use "search" instead of "harvest". This would also use JSON instead of XML and be a huge performance boost to get Edurep metadata. Currently we download 1.8GB of XML data to extract about 2.000 materials. This is not as bad as it sounds, because usually we take a delta and only download a handful op materials from Edurep at a time. However during our monthly refreshes the system takes a hit. Apart from a huge data size drop we'll benefit from faster extraction, because the XML extractor methods are a lot slower than JSON extractor methods (both are on the ExtractProcessor class)

Things that need to be done:

[x] Create a HttpResource that harvests the following: https://wszoeken.edurep.kennisnet.nl/jsonsearch?query=%2A%20AND%20about.repository%20exact%20l4l%20AND%20%28schema%3AeducationalLevel.schema%3AtermCode%20exact%20bbbd99c6-cf49-4980-baed-12388f8dcff4%20OR%20schema%3AeducationalLevel.schema%3AtermCode%20exact%20be140797-803f-4b9e-81cc-5572c711e09c%29
[x] Make sure that in above link the l4l string can be replaced with WikiwijsDelen or wikiwijsmaken (the set id from the source)
[x] Create this new HttpResource in the sources app. Create a factory for it similar to the other sources (with response data placed in the fixtures directory)
[x] Create a new extraction objective also in the sources app
[x] Prove with tests that the OAI-PMH resource + objective in the edurep app yield the same results as JSONSearch resource + objective. Or explain the difference.
[x] Adjust the search-client to transform legacy id prefixes to the new JSON prefixes.
[x] EXTRA: Zet vakvocabulaire termen in study_vocabulary veld als list. Daarna zullen Edurep materialen meegeteld worden bij MetadataValue.frequency

Kalle's Todo:

[x] Complete edurep extraction.
[x] Make fixture based on shared data between XML and JSON sources.
[x] Setup tests to test shared documents.
[x] Fix code based on tests.

fako commented 1 year ago

Documentation about API migration: https://developers.wiki.kennisnet.nl/index.php?title=Edurep:Migraties/2021

fako commented 1 year ago

It is a bit doubtful what the benefits of the JSON update would be. We won't be able to shed the NL-LOM standard, because AnatomyTOOL uses it as well. We can harvest WikiwijsDelen instead of edurep_delen as a start en leave the rest as is.

fako commented 1 year ago

There are problems with vCards in WikiwijsDelen. We're writing parse errors to the harvester log and we'll see how often things go wrong there.

fako commented 1 year ago

The new Wikiwijs Delen does not specify NL-LOM educational level. Only classifications. We'd have to rewrite the educational level extraction to make this work, but it seems like a never ending story. Kennisnet is not giving the XML output the care it needs. Perhaps we should migrate to JSON/search

This query filters out all Wikiwijs Maken materials and is much more efficient than trying to harvest 35.000 materials for the roughly 2.000 relevant materials. https://wszoeken.edurep.kennisnet.nl/jsonsearch?query=(schema:educationalLevel%3DWO%20OR%20schema:educationalLevel%3DHBO)%20AND%20dcterms:publisher%3D%22Wikiwijs%20Maken%22&page-size=20&page=2

Here's more documentation on search possibilities: https://developers.wiki.kennisnet.nl/index.php?title=Edurep:Jsonsearch

Currently we're waiting for answer from Kennisnet about L4L materials. It is already clear however that all links will be broken and I'm still trying to figure out if we can fix those ourselves or whether to take our loses.

fako commented 1 year ago

If we substitute edurep_delen: with WikiwijsDelen:urn:uuid: in the external_id we seem to be able to make a translation between Wikiwijs Delen en edurep_delen.

fako commented 1 year ago

Er zijn twee blockers:

[x] search_client moet geupdate worden in service om redirects goed te laten werken
[x] onduidelijkheid over mime_type en locations waarbij soms minder mime_types zijn dan locations (alleen Sharekit output tot nu toe)

fako commented 1 year ago

next_parameters moet nog geimplementeerd om alle materialen op te halen en niet slechts 10 per bron @kallepronk

fako commented 1 year ago

Deze twee responses zijn op language en authors vlak heel verschillend, terwijl ze identiek zouden moeten zijn: https://dev.search.surfedushare.nl/api/v1/materials/jsonld-from-lom:l4l:oai:library.wur.nl:l4l%2F11603/ https://acc.search.edusources.nl/api/v1/materials/l4l:oai:library.wur.nl:l4l%2F11603/

Gaat blijkbaar al mis bij seeds: https://harvester.dev.surfedushare.nl/api/v1/document/raw/jsonld-from-lom%3Al4l%3Aoai%3Alibrary.wur.nl%3Al4l%252F11603/

fako commented 1 year ago

Another error happens: https://surf-eduservices.sentry.io/issues/4292577777/ Perhaps we should look at running a full fetch on localhost. Should be easy, but takes a minute. Makes it quicker to debug though.

To harvest JSONSearch seeds localhost:

In the admin go to: core -> datasets -> delta
Make sure that WikiwijsDelen, wikiwijsmaken and l4l are all using JSON Search (click pencil and set repository field)
Make sure that for WikiwijsDelen, wikiwijsmaken and l4l the "stage" is "new"
./manage.py harvest_metadata -d delta -r sources.EdurepJsonSearchResource

fako commented 1 year ago

I took a little deeper dive. There's issues with wikiwijsmaken and WikiwijsDelen that we can't solve. There's issues with L4L that we should be able to solve. It think it's a good idea to take a deep dive as well and see if you can find the problems with wikiwijsmaken and WikiwijsDelen yourself. You'll learn a lot about the system from that. To circumvent the issues and focus on solvable problems you can put "wikiwijsmaken" and "WikiwijsDelen" on stage=complete.

fako commented 1 year ago

Edurep refuses to make necessary changes. We need to discuss this with product owners

fako commented 1 year ago

Old situation is 878MB for 2400 materials. We do nightly increments so it's only a real problem once per month (during monthly reset) or when we want to debug something localhost with that data. Data download/load commands already exclude Edurep for performance reasons. Filtering out materials is not really an option, because we need to be able to tell people what happened to material X. If we don't store all materials we won't be able to answer such questions.

fako commented 1 year ago

The good news is that currently everything has a publisher date. This means that we can harvest "years". A single year will almost certainly stay within limits (a monthly iteration could be a way to scale up). We would be using their search in a way that they don't like.

fako commented 1 year ago

Somethings that aren't confirmed to work when this ticket was placed on hold:

Author names would be split into an array: ["F", "a", "k", "o"]
Unclear how well we retrieve consortium data
Language is sometimes set to None, while this should be "unk" when no language is set.

Here is some code to check consortium counts

from core.models import DatasetVersion, Document
from collections import Counter

dv = DatasetVersion.objects.get_current_version()
Counter([doc.properties["consortium"] for doc in Document.objects.filter(dataset_version=dv)])

This release reverts changes made to the search-client. If we ever try to implement this again we need to revert a commit mentioned in the release notes: https://github.com/surfedushare/search-client/releases/tag/v0.4.1

For the rest it's safe to keep the JSON Search Edurep code next to the current OAI-PMH code. We just shouldn't enable JSON Search in the database through the Dataset model.

fako commented 1 year ago

It has unfortunately been decided that this new Edurep connection will not go to production in the short term. This is due to the fact that the limit of learning materials that can be harvested through this connection is 1000. Edurep has no plans to up this limit and therefor this connection isn't useful to us.

surfedushare / search-portal

Update Edurep connection #760