sul-dlss / dor-services-app

A Rails application exposing Digital Object Registry functions as a RESTful HTTP API
https://sul-dlss.github.io/dor-services-app/
Other
3 stars 2 forks source link

[SPIKE] Test migration-utils #3629

Closed justinlittman closed 2 years ago

justinlittman commented 2 years ago

Test migration-utils as an alternative to bin/archive-fedora.

For this test, export 250K objects to OCFL using sdr-deploy server. Please estimate how much space and time would be required for 4.3M objects.

edsu commented 2 years ago

migration-utils can use either the native fcrepo3 filesystem or a directory of exported FOXML data. Since I had a directory of 1000 FOXML files on my workstation that were generated with the archive-fedora I tested with that. The XML files need to be decompressed for migration-utils to be able to read them.

Here is the command I used:

/usr/bin/java -jar migration-utils-6.1.0-driver.jar \
  --source-type exported \
  --target-dir ocfl \
  --exported-dir archive \
  --no-checksum-validation \
  --continue-on-error

The --continue-on-error was needed because it ran into 20 errors like this (it usually stops at the first one):

ERROR 13:53:38.765 (Migrator) MIGRATION_FAILURE: pid="druid:bb059dp5973", message="text"
java.lang.NullPointerException: text
        at java.base/java.util.Objects.requireNonNull(Objects.java:233)
        at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1955)
        at java.base/java.time.Instant.parse(Instant.java:399)
        at org.fcrepo.migration.handlers.ocfl.ArchiveGroupHandler.createDatastreamHeaders(ArchiveGroupHandler.java:618)
        at org.fcrepo.migration.handlers.ocfl.ArchiveGroupHandler.processObjectVersions(ArchiveGroupHandler.java:256)
...

Also --no-checksum-vaildation was needed to avoid over a thousand warnings like:

WARN 13:55:40.246 (ArchiveGroupHandler) info:fedora/druid:bb052vc4171/descMetadata: missing/invalid digest. Writing resource & continuing

Maybe these problems are an artifact of working with the FOXML export?

It took 74 seconds to convert 1000 FOXML files. The directory of uncompressed FOXML files was 190M and the resulting OCFL directory was 1.7G !

I think next I'll try working with the native fcrepo3 filesystem, which will need to be done on sul-dor-migrate instead of sdr-deploy.

edsu commented 2 years ago

I ticketed the NullPointerException over in the migration-utils issue tracker in case there's something obvious going on here:

https://github.com/fcrepo-exts/migration-utils/issues/180

edsu commented 2 years ago

Getting migration-utils working on sul-dor-migrate is a little bit tricky. Java OpenJDK Runtime Environment (build 1.8.0_322-b06) is already available, but when I try to run the jar I get an error:

[lyberadmin@sul-dor-migrate ocfl]$ java -jar migration-utils-6.1.0-driver.jar
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/fcrepo/migration/PicocliMigrator has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:601)

It looks like there's a Java version mismatch that needs to be resolved to use the jar. I did try to build migration-utils myself using the Java that is on sul-dor-migrate by installing maven. But the version of maven that is yum installable isn't up to 3.1 yet, which is required by migration-utils.

pwinckles commented 2 years ago

Read this ticket since it was linked to the one you opened against migration-utils. A couple of notes:

  1. migration-utils requires Java 11+, so it will not run with Java 8
  2. The OCFL layout will necessarily take more space than the Fedora 3 layout. However, an increase from 190MB to 1.7GB is unexpected. I am curious about what's going on here.
edsu commented 2 years ago

Thank you @pwinckles! My bad, I had read the Java11 requirement and for some reason interpreted Java1.8 as v18.

If there's any information I can provide about the foxml or ocfl that might help explain the size increase please let me know.

pwinckles commented 2 years ago

You could execute something like the following in the ocfl root directory:

find . -type f -printf "%s\t%f\n" | awk '{c[$2] += 1; s[$2] += $1} END {for (f in c) printf("%s\t%-9s\t%s\n", c[f], s[f], f)}' | sort -k2nr

The first column in the output is the number of files with the same name, the second column is the total number bytes all of the files with that name use, and the final column is the file name.

It sounds like your sample migration set was 1000 objects, yes? Do you have any idea off hand if datastreams in those objects were heavily versioned? Were all of the datastreams internal to the FOXML? If not, did the number you gave include size of the datastreams that are outside the FOXML?

edsu commented 2 years ago

You can find my current experiment on sul-dor-migrate in /data/ocfl where I've dropped a local Java18 environment that doesn't interfere with the system Java. Unfortunately I ran into a different kind of error when running against 1000 of our objects using the fcrepo3 filesystem.

java -jar migration-utils-6.1.0-driver.jar \
  --source-type legacy \
  --datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
  --objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
  --target-dir ocfl \
  --limit 1000 \
  --continue-on-error \
  --debug

The error that keeps continuing is:

ERROR 05:05:41.716 (Migrator) MIGRATION_FAILURE: pid="druid:cf604sc3425", message="Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!
        at org.fcrepo.migration.foxml.DirectoryScanningIDResolver.resolveInternalID(DirectoryScanningIDResolver.java:129)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor$Foxml11DatastreamVersion.<init>(FoxmlInputStreamFedoraObjectProcessor.java:401)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor.processObject(FoxmlInputStreamFedoraObjectProcessor.java:194)
        at org.fcrepo.migration.Migrator.run(Migrator.java:161)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:328)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
        at picocli.CommandLine.execute(CommandLine.java:1864)
        at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)

I dropped an issue into the migration-utils issue tracker to just double check that I'm not doing something wrong:

https://github.com/fcrepo-exts/migration-utils/issues/181

edsu commented 2 years ago

I think the Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0" is the result of the flawed copy of the fcrepo3 filesystem on sul-dor-migrate.

edsu commented 2 years ago

Actually I take it back, what is flawed is my understanding of how the fcrepo3 filesystem works. I had thought the datastreamStore subdirectory was named using the first two characters in the pid, but on closer inspection I can see this is not the case.

edsu commented 2 years ago

I got some clarity about what fcrepo3 storage we are using (akubra) and that it matters how you ran migration-utils initially because it generates an index to do datastream looups efficiently, and if you generated the index with the wrong storage type, lookups will always fail.

I'm timing a run with a sample of 50k instead of 250k to ensure I don't overrun available disk space on sdr-migrate (415G on /data).

edsu commented 2 years ago

The command I ran was:

java -jar migration-utils-6.2.0-SNAPSHOT-driver.jar \
  --source-type akubra \
  --datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
  --objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
  --target-dir ocfl \
  --limit 50000 \
  --continue-on-error \
  --migration-type PLAIN_OCFL \
  --foxml-file \
  --debug \
  > convert.log

It took 10.5 hours to convert 50,000 records. The resulting OCFL tree used 35GB of space. I didn't monitor iostat during the conversion but I suspect that the time was mostly spent reading and writing to sf5-webapp-dev:/sul_dor_migrate_data and wasn't time bound by the tool itself.

There were 431 errors that all looked to be the result of a failure to lookup a datastream ID:

INFO 16:21:00.892 (Migrator) Processing "druid:bc359gg4344"...
ERROR 16:21:00.896 (Migrator) MIGRATION_FAILURE: pid="druid:bc359gg4344", message="Unable to resolve internal ID "druid:bc359gg4344+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:bc359gg4344+rightsMetadata+rightsMetadata.0"!
        at org.fcrepo.migration.foxml.DirectoryScanningIDResolver.resolveInternalID(DirectoryScanningIDResolver.java:129)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor$Foxml11DatastreamVersion.<init>(FoxmlInputStreamFedoraObjectProcessor.java:404)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor.processObject(FoxmlInputStreamFedoraObjectProcessor.java:195)
        at org.fcrepo.migration.Migrator.run(Migrator.java:161)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:328)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
        at picocli.CommandLine.execute(CommandLine.java:1864)
        at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)

These appear to be legit errors because the data stream is actually found using a slightly different PID: druid:bc359gg4344/rightsMetadata/rightsMetadata.0 instead of druid:bc359gg4344+rightsMetadata+rightsMetadata.0 (note the slashes instead of plusses).

If the OCFL export were desirable enough at this point I guess we could fix the incorrect foxml, or we could rig migration-utils to update it it as it went. Otherwise these objects would not be present in the export.

It also looks like there were 533,943 warnings like this:

WARN 16:20:47.214 (ArchiveGroupHandler) info:fedora/druid:bb022jb1472/contentMetadata: missing/invalid digest. Writing resource & continuing.

These warnings did not prevent the objects from being written. It wasn't clear to me if the the digests were missing or invalid; but I was assumed it was the former given the sheer number of them.

@andrewjbtw if you would like to examine the resulting OCFL tree you should be able to find it at:

sul-dor-migrate.stanford.edu:/data/ocfl/ocfl

It is kind of nice because all the versions are laid out on the filesystem:

[lyberadmin@sul-dor-migrate 650]$ tree 443d0c650c5ac35dfd9b681a9b490d812ab366381d9e887d95322406b17307cd/
443d0c650c5ac35dfd9b681a9b490d812ab366381d9e887d95322406b17307cd/
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│   ├── content
│   │   ├── AUDIT
│   │   ├── DC
│   │   └── FOXML
│   ├── inventory.json
│   └── inventory.json.sha512
├── v10
│   ├── inventory.json
│   └── inventory.json.sha512
├── v11
│   ├── content
│   │   └── identityMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v12
│   ├── content
│   │   └── events
│   ├── inventory.json
│   └── inventory.json.sha512
├── v13
│   ├── content
│   │   └── provenanceMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v14
│   ├── content
│   │   └── rightsMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v15
│   ├── inventory.json
│   └── inventory.json.sha512
├── v16
│   ├── content
│   │   └── provenanceMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v17
│   ├── content
│   │   └── contentMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v18
│   ├── content
│   │   └── rightsMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v19
│   ├── content
│   │   └── descMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v2
│   ├── content
│   │   └── RELS-EXT
│   ├── inventory.json
│   └── inventory.json.sha512
├── v20
│   ├── content
│   │   └── versionMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v21
│   ├── inventory.json
│   └── inventory.json.sha512
├── v22
│   ├── content
│   │   └── provenanceMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v23
│   ├── inventory.json
│   └── inventory.json.sha512
├── v3
│   ├── content
│   │   └── workflows
│   ├── inventory.json
│   └── inventory.json.sha512
├── v4
│   ├── content
│   │   └── identityMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v5
│   ├── content
│   │   └── RELS-EXT
│   ├── inventory.json
│   └── inventory.json.sha512
├── v6
│   ├── content
│   │   └── descMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v7
│   ├── content
│   │   └── rightsMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
├── v8
│   ├── content
│   │   └── contentMetadata
│   ├── inventory.json
│   └── inventory.json.sha512
└── v9
    ├── content
    │   └── technicalMetadata
    ├── inventory.json
    └── inventory.json.sha512

I think I can call this SPIKE done for now.

edsu commented 2 years ago

Reopening based on conversation in Slack about whether our PID (DRUID) can be the key used to generate the OCFL path or not. It kind of defeats the purpose of OCFL if we can't use our ID to quickly look up objects?

pwinckles commented 2 years ago

Check the --id-prefix option

edsu commented 2 years ago

@pwinckles thanks for the pointer! Just to clarify in case it wasn't clear from the above, we currently have Fedora PIDs that look like druid:bk854wj7180 and when migration-utils runs it creates an OCFL Object Root (in this case) of:

ocfl/aa5/6b6/afc/aa56b6afcf3026c8802b93bda2f20a89d23189cdc61f64140450baf18746ee5a

We were really hoping we could use the PID to create the OCFL Object Root instead of what looks like it might be a UUID? So in this case some variation of:

ocfl/bk8/54w/j71/bk854wj7180   

It looks like --id-prefix just controls whether info:fedora/ is included in the id property found in the Object Inventory (inventory.json)?

pwinckles commented 2 years ago

Yes, that's correct. --id-prefix only affects the OCFL object id and does not affect the OCFL storage layout. It is possible to change the layout, but doing so is a little more involved.

I believe this is the storage layout that you're wanting: https://ocfl.github.io/extensions/0007-n-tuple-omit-prefix-storage-layout.html

In order to use this layout, you have to do the following:

  1. Before running the migration utility you need to manually create the ocfl-root directory exactly where the migration utility would normally create it.
  2. In the manually created directory, you need to write the 0=ocfl_1.0 file
  3. Then, add the ocfl_layout.json with the following content:
    {
    "extension" : "0007-n-tuple-omit-prefix-storage-layout",
    "description" : "This storage root extension describes an OCFL storage layout combining a pairtree-like root directory structure derived from prefix-omitted object identifiers, followed by the prefix-omitted object identifier themselves. The OCFL object identifiers are expected to contain prefixes which are removed in the mapping to directory names. The OCFL object identifier prefix is defined as all characters before and including a configurable delimiter. Where the prefix-omitted identifier length is less than tuple size * number of tuples, the remaining object id (prefix omitted) is left or right-side, zero-padded (configurable, left default), or not padded (none), and optionally reversed (default false). The object id is then divided into N n-tuple segments, and used to create nested paths under the OCFL storage root, followed by the prefix-omitted object id directory."
    }
  4. Also in the root, create the directory extensions/0007-n-tuple-omit-prefix-storage-layout
  5. In that directory write config.json with the following content:
    {
    "delimiter" : ":",
    "tupleSize" : 3,
    "numberOfTuples" : 3,
    "zeroPadding" : "left",
    "reverseObjectRoot" : false,
    "extensionName" : "0007-n-tuple-omit-prefix-storage-layout"
    }
  6. Finally, start the migration as you normally would

Now, it will not create a new OCFL repository and will write to the one you manually created, using the layout you specified.

However, I suspect it will not work correctly without upgrading the version of ocfl-java. It's currently using version 1.4.3, which has a bug for this particular layout. You should be able to override the version by adding the following the migration-utils pom:

<dependency>
    <groupId>edu.wisc.library.ocfl</groupId>
    <artifactId>ocfl-java-core</artifactId>
    <version>1.4.6</version>
</dependency>

We will get the version Fedora is using updated sometime before the next release.

[Edit] To be clear, if you want that layout you need to both set --id-prefix to an empty string and manually set the OCFL storage layout

edsu commented 2 years ago

Thanks @pwinckles -- I think I followed your instructions above, but I ran into this error when running with the new jar:

SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions prior to 1.8.
SLF4J: Ignoring binding found at [jar:file:/data/ocfl/migration-utils-6.2.0-SNAPSHOT-driver.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#ignoredBindings for an explanation.
class org.slf4j.helpers.NOPLoggerFactory cannot be cast to class ch.qos.logback.classic.LoggerContext (org.slf4j.helpers.NOPLoggerFactory and ch.qos.logback.classic.LoggerContext are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.slf4j.helpers.NOPLoggerFactory cannot be cast to class ch.qos.logback.classic.LoggerContext (org.slf4j.helpers.NOPLoggerFactory and ch.qos.logback.classic.LoggerContext are in unnamed module of loader 'app')
    at org.fcrepo.migration.PicocliMigrator.setDebugLogLevel(PicocliMigrator.java:201)
    at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:211)
    at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
    at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
    at picocli.CommandLine.access$900(CommandLine.java:145)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
    at picocli.CommandLine.execute(CommandLine.java:1864)
    at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)

Maybe there's something else that needs to be added to the pom?

pwinckles commented 2 years ago

Try it with:

    <dependency>
      <groupId>edu.wisc.library.ocfl</groupId>
      <artifactId>ocfl-java-core</artifactId>
      <version>1.4.6</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-api</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
edsu commented 2 years ago

This worked perfectly @pwinckles!

I'm going to re-run the 50k test with these options.

edsu commented 2 years ago

Since the setup got a bit more complicated I've created a repo to track the steps:

https://github.com/sul-dlss-labs/dor-ocfl-migrate

edsu commented 2 years ago

This Spike is hereby declared finished!