inveniosoftware / invenio-rdm-records

DataCite-based data model for InvenioRDM flavour.
https://invenio-rdm-records.readthedocs.io
MIT License
15 stars 87 forks source link

InvenioRDM migration from v11 to v12 fails when registering parent DOIs #1851

Open danielbreves opened 1 month ago

danielbreves commented 1 month ago

Package version (if known): 10.8.6

Describe the bug

I got the following exception when running https://github.com/inveniosoftware/invenio-app-rdm/blob/master/invenio_app_rdm/upgrade_scripts/migrate_11_0_to_12_0.py:

Updating record : tcz41-yp821
> Error KeyError('doi')
  File "/usr/local/lib/python3.9/site-packages/invenio_app_rdm/upgrade_scripts/migrate_11_0_to_12_0.py", line 128, in update_record
    update_parent(record)
  File "/usr/local/lib/python3.9/site-packages/invenio_app_rdm/upgrade_scripts/migrate_11_0_to_12_0.py", line 104, in update_parent
    current_rdm_records.records_service.pids.register_or_update(
  File "/usr/local/lib/python3.9/site-packages/invenio_records_resources/services/uow.py", line 376, in inner
    res = f(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/invenio_rdm_records/services/pids/service.py", line 218, in register_or_update
    pid_manager.register(pid_record, scheme, url=url)
  File "/usr/local/lib/python3.9/site-packages/invenio_rdm_records/services/pids/manager.py", line 237, in register
    provider.register(pid, record=record, url=url)
  File "/usr/local/lib/python3.9/site-packages/invenio_rdm_records/services/pids/providers/datacite.py", line 173, in register
    doc = self.serializer.dump_obj(record)
  File "/usr/local/lib/python3.9/site-packages/flask_resources/serializers/base.py", line 66, in dump_obj
    return self.object_schema.dump(obj)
  File "/usr/local/lib/python3.9/site-packages/marshmallow/schema.py", line 553, in dump
    result = self._serialize(processed_obj, many=many)
  File "/usr/local/lib/python3.9/site-packages/marshmallow/schema.py", line 521, in _serialize
    value = field_obj.serialize(attr_name, obj, accessor=self.get_attribute)
  File "/usr/local/lib/python3.9/site-packages/marshmallow/fields.py", line 341, in serialize
    return self._serialize(value, attr, obj, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/marshmallow/fields.py", line 1985, in _serialize
    return self._serialize_method(obj)
  File "/usr/local/lib/python3.9/site-packages/invenio_rdm_records/resources/serializers/datacite/schema.py", line 421, in get_related_identifiers
    version_doi = version["pids"]["doi"]

This happened because a child record didn't have a DOI. Could that line be version["pids"].get("doi")?

Steps to Reproduce

  1. Setup an instance of Invenio RDM v11
  2. Create and publish a record without a DOI
  3. Create a new version with a DOI
  4. Attempt to migrate to v12
  5. See error

Expected behavior

The migration should succeed since DOIs can be optional

tmorrell commented 1 month ago

Thanks for this report!

There are currently two main DOI configurations in InvenioRDM

DATACITE_ENABLED = True - All records have a DOI DATACITE_ENABLED = False - No records have a DOI

So then in the migration script https://github.com/inveniosoftware/invenio-app-rdm/blob/810befded43dc5ea49198bf992d6e2393c75ff1e/invenio_app_rdm/upgrade_scripts/migrate_11_0_to_12_0.py#L85 the parent DOI block will skip registering DOIs when DATACITE is not enabled.

In v12 we also added an additional "is_enabled" setting, which provides some ability to determine whether DOIs are minted based on set criteria. This is also checked in the migration script.

It sounds like you have records that don't have DOIs but you have DATACITE enabled. Do you really want parent DOIs minted for these records? That seems like a unusual choice.

I suspect you'll either want to disable parent DOIs with RDM_PARENT_PERSISTENT_IDENTIFIERS={}, or develop a is_enabled setting to exclude those records without DOIs. But please respond back with your use case if neither of those options sound appropriate.

fenekku commented 1 month ago

I think the underlying issue here is that at time of calling register_or_update (https://github.com/inveniosoftware/invenio-app-rdm/blob/master/invenio_app_rdm/upgrade_scripts/migrate_11_0_to_12_0.py#L104) the records with "pids": {"doi": ...} have not been indexed yet (migrate script only commits them to DB not document engine). The manual call to update the records in the document engine (pipenv run invenio rdm rebuild-all-indices) is only called after the script has been run. But https://github.com/inveniosoftware/invenio-rdm-records/blob/master/invenio_rdm_records/resources/serializers/datacite/schema.py#L415 relies on that indexing to have occurred already to get the versions that are serialized as related identifiers to Datacite.

The migrate script should be split in 2 and interleaved with an enjoinment to the reader to run pipenv run invenio rdm rebuild-all-indices before running the 2nd script that would run the register_or_update code. @danielbreves can use that approach right away too.

tmorrell commented 1 month ago

@fenekku I don't think that's the issue. The child records should already have DOIs before the migration script runs. The migration script is only adding the parent DOI, not the child DOI.

fenekku commented 1 month ago

Hmm :thinking: ... Here is the situation, I had a similar issue with the migration script and minting the parent and splitting the script in 2 passes is what solved it for me, because I think that despite the "child" records already having "pids: {"doi": ...}, they are not indexed in the new index at this point. They were only in the old one and the new one was consulted (because v12 is running)...

Nevertheless, it may not be quite what the original problem was. @danielbreves How did you create a record with a DOI and a record without a DOI in the same instance? Did you change the datacite settings @tmorrell mentioned above (in-between or otherwise)? Understanding that would help us identify the underlying issue.

danielbreves commented 1 month ago

@tmorrell @fenekku thanks for the fast reply! Our InvenioRDM instance started with DATACITE_ENABLED = False and had some records created without DOIs, then we enabled DATACITE and created new versions of records with DOIs.

It sounds like you have records that don't have DOIs but you have DATACITE enabled. Do you really want parent DOIs minted for these records? That seems like a unusual choice.

We would like parent DOIs minted, since all future records should have DOIs.

danielbreves commented 1 month ago

As a side note, I had to add the line below to the migration script to make it print out the stacktrace, since KeyError('doi') wasn't very helpful.

import traceback

traceback_str = ''.join(traceback.format_tb(e.__traceback__))
secho(traceback_str, fg="red")
fenekku commented 1 month ago

Our InvenioRDM instance started with DATACITE_ENABLED = False and had some records created without DOIs, then we enabled DATACITE and created new versions of records with DOIs.

Ah got it. Then yes, I think it'd be reasonable if you could make a PR with that change from ["doi"] to .get("doi") with a comment explaining how DOI minting may have been enabled after the fact. That way, when no "doi", this will skip serializing that entry which is fine.

Thanks!