Closed fako closed 1 year ago
A proposal to create this feature.
Every HarvestSource (https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/models/harvest.py#L14) should get a repository_name. These names should be set to "sharekit", "edurep", "hku", "hva" etc. etc. These names correspond with the first part of the new URN.
When we create a Collection (https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/management/commands/harvest_metadata.py#L169) we should use the repository_name as well as the spec to create the collection name. The collection name could become "sharekit:edusources" for example. The "referee" of a new collection should become "urn" (see: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/management/commands/harvest_metadata.py#L172)
Here we set the "id" of a document when it gets created (by metadata command or the webhook): https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/models/datatypes/document.py#L17 This should become "urn" to make clear what it tries to achieve (even though it's not a true URN, we could use SRN for SURF Resource Name if desired). As far as I know we don't use this id property anymore and it's safe to rename it to urn. You can concatonate the new collection.repository_name and external_id with a : to get the urn value.
The harvest_metadata command uses external_id to look for existing Documents and Extensions to update/merge with the seeds. See this file: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/management/commands/harvest_metadata.py We should rewrite this to use the "urn" and not the "external_id". Existing extensions (only at NPPO) need to be migrated or John Meurs should agree to delete all of the existing data.
The to_search method of Document uses external_id to set the _id property. This should use urn instead. See: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/models/datatypes/document.py#L108-142
The sync_inidices task uses collection name to determine whether it should or should not update a Document (that was most likely changed by the webhook). See: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/tasks/open_search.py This dependency on Collection.name is a patch to solve a bug. When we integrate NPPO I plan to remove this patch. So I don't think it should get touched with this story, but it's important to know that webhooks probably will stop working. I plan to integrate NPPO and Edusources soon. There is already a "release" MR. I'm fixing the problems that arise from it. Progress can be found here: https://github.com/nppo/search-portal/issues/57
The Extension view also heavily uses external_id to check if documents and extensions match. This should all be rewritten to use the urn property instead.
When implementing this story a new Dataset named "delta" should be created. The gamma dataset should remain as is, while delta will contain the new urn. You can write a migration to deactivate gamma and create + activate delta. Delta should be marked as is_latest. When you write a migration for it be sure that you don't hardcode any values. Take the last active Dataset and copy any Harvest objects over to the new Dataset. Don't assume anything about names of previous datasets. The migration should be able to run on NPPO and Edusources development. Both are currently still on the alpha dataset, but we should move this to delta to make it the same everywhere.
A lot of tests are expected to fail. Make sure that you edit the dataset fixtures: https://gitlab.surf.nl/edusources/edusources/-/tree/acceptance/harvester/core/fixtures As well as update the factories that still use a very old Edurep based value as reference: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/tests/factories/factories.py#L53
All the logging is done based on external_id: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/logging.py#L72 This should also become URN. This will invalidate all historic logs. I don't think that is a problem, because we mostly log for debugging purposes and we ignore logs older than a few days most of the time.
Tests should look at urn to see if a seed is valid not the external_id. See: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/core/tests/base.py#L29
It is time to delete sync_sharekit_metadata and everything related to it like its tests: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/harvester/sharekit/tasks.py#L22
So far everything described above is only dealing with the harvester. On to the service. In the service there is a model named Material that uses external_id which should become urn: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/models.py#L25 Have a good look at all the views in the materials app: https://gitlab.surf.nl/edusources/edusources/-/tree/acceptance/service/surf/apps/materials/views They use the Material model quite heavily, but also be careful, because the external_id used in widget.py has nothing to do with what you're trying to change. So keep that the same :) The widget template does use the "material" external id and it should change: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/jinja2/widget/index.html In the materials app there are also util functions that reference the external_id: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/utils.py#L12 And also: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/utils.py#L75 Again with the utils file. Be careful not to update the filters externalid as well. Those are different and are not getting renamed. This import in the materials app is a bit unexpected so I'm pointing it out: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/views/app.py#L9 That function should get renamed and the prefix should be removed. The rest of the app file also uses external_id which should become urn. Some stuff in the materials app can get cleaned up like the applaud view: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/views/metrics.py#L30
The get_materials_by_id from the SearchClient should get updated to use urn: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/vendor/search/api.py#L258 Here there should also be a fallback that tries to match on the external_id if the urn yields no results. When you implement that logic here it should propagate to everywhere where this is needed. Apart from the very essential get_materials_by_id you also need to update the more_like_this method: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/vendor/search/api.py#L307
This filter https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/materials/filters.py#L11 is still using material_id. This should become urn as well, because material_id is a very old name for material.external_id.
The similarity documentation references external_id. We should update that: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/apps/core/schema.py#L47
The external_id is also used in some of the routes: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/urls.py Notice that one of those routes has a translation: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/surf/locale/en/LC_MESSAGES/django.po#L28-30
The e2e mock file uses external_id and this should become urn: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/e2e_tests/mock.py We can keep the external_id in the mock output (https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/e2e_tests/mock.py#L19) next to a new urn property There's no need to be able to set the external_id, instead we want to set urn: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/service/e2e_tests/mock.py#L80 There are of course for legacy e2e tests. Bringing these to Cypress might be a good way to work around issues if you encounter them.
In general a warning. Every external_id that has to do with Community or filters are not supposed to change. There are quite a few of them. So make sure you're not changing those.
Last but not least the frontend. Quite a large part of the external_id in the frontend refer to filter external_id's and those should remain the same. Everywhere where you see a material.external_id it's probably safe to assume it needs to become material.urn. I only list things here that are not using material.external_id. However don't blindly search and replace material.external_id, because often in the same statement it references external_id. Often it is somehting like: {external_id: material.external_id}. If you blindly search and replace on material.external_id you're going to miss out on the first reference here.
Although this file deals with some filter stuff it is mostly about materials as the name suggests: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/src/store/modules/materials.js So most of the external_id's need to change to urn here.
Bit of an odd one, but good to update this: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/matomo.md?plain=1#L42
This sneaky one needs an update or things will break: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/src/components/Collections/AddMaterialPopup.vue#L102 Here's another sneaky one: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/src/components/Materials/MaterialSet/MaterialPartOfSet.vue#L50 This one is a bit less sneaky, but pointing it out anyway: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/src/components/Materials/MaterialSet/MaterialSet.vue#L16
This is a good opportunity to remove the navigation component: https://gitlab.surf.nl/edusources/edusources/-/tree/acceptance/portal/src/components/Materials/Navigation It hasn't been in use for over a year.
Here the similarity call is made with external_id and that should be urn: https://gitlab.surf.nl/edusources/edusources/-/blob/acceptance/portal/src/pages/material.vue#L89
Why
As an edusources developer I want to use URNs instead of material ids so I can avoid conflicts with different SURFsharekit channels
Acceptance Criteria
Notes: External id's can be used to detect "old" links and redirect the user to the new URN based URL