Closed justinlittman closed 2 years ago
migration-utils can use either the native fcrepo3 filesystem or a directory of exported FOXML data. Since I had a directory of 1000 FOXML files on my workstation that were generated with the archive-fedora I tested with that. The XML files need to be decompressed for migration-utils to be able to read them.
Here is the command I used:
/usr/bin/java -jar migration-utils-6.1.0-driver.jar \
--source-type exported \
--target-dir ocfl \
--exported-dir archive \
--no-checksum-validation \
--continue-on-error
The --continue-on-error
was needed because it ran into 20 errors like this (it usually stops at the first one):
ERROR 13:53:38.765 (Migrator) MIGRATION_FAILURE: pid="druid:bb059dp5973", message="text"
java.lang.NullPointerException: text
at java.base/java.util.Objects.requireNonNull(Objects.java:233)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1955)
at java.base/java.time.Instant.parse(Instant.java:399)
at org.fcrepo.migration.handlers.ocfl.ArchiveGroupHandler.createDatastreamHeaders(ArchiveGroupHandler.java:618)
at org.fcrepo.migration.handlers.ocfl.ArchiveGroupHandler.processObjectVersions(ArchiveGroupHandler.java:256)
...
Also --no-checksum-vaildation
was needed to avoid over a thousand warnings like:
WARN 13:55:40.246 (ArchiveGroupHandler) info:fedora/druid:bb052vc4171/descMetadata: missing/invalid digest. Writing resource & continuing
Maybe these problems are an artifact of working with the FOXML export?
It took 74 seconds to convert 1000 FOXML files. The directory of uncompressed FOXML files was 190M and the resulting OCFL directory was 1.7G !
I think next I'll try working with the native fcrepo3 filesystem, which will need to be done on sul-dor-migrate
instead of sdr-deploy
.
I ticketed the NullPointerException over in the migration-utils issue tracker in case there's something obvious going on here:
Getting migration-utils working on sul-dor-migrate is a little bit tricky. Java OpenJDK Runtime Environment (build 1.8.0_322-b06)
is already available, but when I try to run the jar I get an error:
[lyberadmin@sul-dor-migrate ocfl]$ java -jar migration-utils-6.1.0-driver.jar
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/fcrepo/migration/PicocliMigrator has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:601)
It looks like there's a Java version mismatch that needs to be resolved to use the jar. I did try to build migration-utils myself using the Java that is on sul-dor-migrate by installing maven. But the version of maven that is yum installable isn't up to 3.1 yet, which is required by migration-utils.
Read this ticket since it was linked to the one you opened against migration-utils. A couple of notes:
Thank you @pwinckles! My bad, I had read the Java11 requirement and for some reason interpreted Java1.8 as v18.
If there's any information I can provide about the foxml or ocfl that might help explain the size increase please let me know.
You could execute something like the following in the ocfl root directory:
find . -type f -printf "%s\t%f\n" | awk '{c[$2] += 1; s[$2] += $1} END {for (f in c) printf("%s\t%-9s\t%s\n", c[f], s[f], f)}' | sort -k2nr
The first column in the output is the number of files with the same name, the second column is the total number bytes all of the files with that name use, and the final column is the file name.
It sounds like your sample migration set was 1000 objects, yes? Do you have any idea off hand if datastreams in those objects were heavily versioned? Were all of the datastreams internal to the FOXML? If not, did the number you gave include size of the datastreams that are outside the FOXML?
You can find my current experiment on sul-dor-migrate
in /data/ocfl
where I've dropped a local Java18 environment that doesn't interfere with the system Java. Unfortunately I ran into a different kind of error when running against 1000 of our objects using the fcrepo3 filesystem.
java -jar migration-utils-6.1.0-driver.jar \
--source-type legacy \
--datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
--objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
--target-dir ocfl \
--limit 1000 \
--continue-on-error \
--debug
The error that keeps continuing is:
ERROR 05:05:41.716 (Migrator) MIGRATION_FAILURE: pid="druid:cf604sc3425", message="Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!
at org.fcrepo.migration.foxml.DirectoryScanningIDResolver.resolveInternalID(DirectoryScanningIDResolver.java:129)
at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor$Foxml11DatastreamVersion.<init>(FoxmlInputStreamFedoraObjectProcessor.java:401)
at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor.processObject(FoxmlInputStreamFedoraObjectProcessor.java:194)
at org.fcrepo.migration.Migrator.run(Migrator.java:161)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:328)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
at picocli.CommandLine.access$900(CommandLine.java:145)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
at picocli.CommandLine.execute(CommandLine.java:1864)
at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)
I dropped an issue into the migration-utils issue tracker to just double check that I'm not doing something wrong:
I think the Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"
is the result of the flawed copy of the fcrepo3 filesystem on sul-dor-migrate.
Actually I take it back, what is flawed is my understanding of how the fcrepo3 filesystem works. I had thought the datastreamStore subdirectory was named using the first two characters in the pid, but on closer inspection I can see this is not the case.
I got some clarity about what fcrepo3 storage we are using (akubra) and that it matters how you ran migration-utils initially because it generates an index to do datastream looups efficiently, and if you generated the index with the wrong storage type, lookups will always fail.
I'm timing a run with a sample of 50k instead of 250k to ensure I don't overrun available disk space on sdr-migrate (415G on /data).
The command I ran was:
java -jar migration-utils-6.2.0-SNAPSHOT-driver.jar \
--source-type akubra \
--datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
--objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
--target-dir ocfl \
--limit 50000 \
--continue-on-error \
--migration-type PLAIN_OCFL \
--foxml-file \
--debug \
> convert.log
It took 10.5 hours to convert 50,000 records. The resulting OCFL tree used 35GB of space. I didn't monitor iostat during the conversion but I suspect that the time was mostly spent reading and writing to sf5-webapp-dev:/sul_dor_migrate_data
and wasn't time bound by the tool itself.
There were 431 errors that all looked to be the result of a failure to lookup a datastream ID:
INFO 16:21:00.892 (Migrator) Processing "druid:bc359gg4344"...
ERROR 16:21:00.896 (Migrator) MIGRATION_FAILURE: pid="druid:bc359gg4344", message="Unable to resolve internal ID "druid:bc359gg4344+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:bc359gg4344+rightsMetadata+rightsMetadata.0"!
at org.fcrepo.migration.foxml.DirectoryScanningIDResolver.resolveInternalID(DirectoryScanningIDResolver.java:129)
at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor$Foxml11DatastreamVersion.<init>(FoxmlInputStreamFedoraObjectProcessor.java:404)
at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor.processObject(FoxmlInputStreamFedoraObjectProcessor.java:195)
at org.fcrepo.migration.Migrator.run(Migrator.java:161)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:328)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
at picocli.CommandLine.access$900(CommandLine.java:145)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
at picocli.CommandLine.execute(CommandLine.java:1864)
at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)
These appear to be legit errors because the data stream is actually found using a slightly different PID: druid:bc359gg4344/rightsMetadata/rightsMetadata.0
instead of druid:bc359gg4344+rightsMetadata+rightsMetadata.0
(note the slashes instead of plusses).
If the OCFL export were desirable enough at this point I guess we could fix the incorrect foxml, or we could rig migration-utils to update it it as it went. Otherwise these objects would not be present in the export.
It also looks like there were 533,943 warnings like this:
WARN 16:20:47.214 (ArchiveGroupHandler) info:fedora/druid:bb022jb1472/contentMetadata: missing/invalid digest. Writing resource & continuing.
These warnings did not prevent the objects from being written. It wasn't clear to me if the the digests were missing or invalid; but I was assumed it was the former given the sheer number of them.
@andrewjbtw if you would like to examine the resulting OCFL tree you should be able to find it at:
sul-dor-migrate.stanford.edu:/data/ocfl/ocfl
It is kind of nice because all the versions are laid out on the filesystem:
[lyberadmin@sul-dor-migrate 650]$ tree 443d0c650c5ac35dfd9b681a9b490d812ab366381d9e887d95322406b17307cd/
443d0c650c5ac35dfd9b681a9b490d812ab366381d9e887d95322406b17307cd/
├── 0=ocfl_object_1.0
├── inventory.json
├── inventory.json.sha512
├── v1
│ ├── content
│ │ ├── AUDIT
│ │ ├── DC
│ │ └── FOXML
│ ├── inventory.json
│ └── inventory.json.sha512
├── v10
│ ├── inventory.json
│ └── inventory.json.sha512
├── v11
│ ├── content
│ │ └── identityMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v12
│ ├── content
│ │ └── events
│ ├── inventory.json
│ └── inventory.json.sha512
├── v13
│ ├── content
│ │ └── provenanceMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v14
│ ├── content
│ │ └── rightsMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v15
│ ├── inventory.json
│ └── inventory.json.sha512
├── v16
│ ├── content
│ │ └── provenanceMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v17
│ ├── content
│ │ └── contentMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v18
│ ├── content
│ │ └── rightsMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v19
│ ├── content
│ │ └── descMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v2
│ ├── content
│ │ └── RELS-EXT
│ ├── inventory.json
│ └── inventory.json.sha512
├── v20
│ ├── content
│ │ └── versionMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v21
│ ├── inventory.json
│ └── inventory.json.sha512
├── v22
│ ├── content
│ │ └── provenanceMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v23
│ ├── inventory.json
│ └── inventory.json.sha512
├── v3
│ ├── content
│ │ └── workflows
│ ├── inventory.json
│ └── inventory.json.sha512
├── v4
│ ├── content
│ │ └── identityMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v5
│ ├── content
│ │ └── RELS-EXT
│ ├── inventory.json
│ └── inventory.json.sha512
├── v6
│ ├── content
│ │ └── descMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v7
│ ├── content
│ │ └── rightsMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
├── v8
│ ├── content
│ │ └── contentMetadata
│ ├── inventory.json
│ └── inventory.json.sha512
└── v9
├── content
│ └── technicalMetadata
├── inventory.json
└── inventory.json.sha512
I think I can call this SPIKE done for now.
Reopening based on conversation in Slack about whether our PID (DRUID) can be the key used to generate the OCFL path or not. It kind of defeats the purpose of OCFL if we can't use our ID to quickly look up objects?
Check the --id-prefix
option
@pwinckles thanks for the pointer! Just to clarify in case it wasn't clear from the above, we currently have Fedora PIDs that look like druid:bk854wj7180
and when migration-utils runs it creates an OCFL Object Root (in this case) of:
ocfl/aa5/6b6/afc/aa56b6afcf3026c8802b93bda2f20a89d23189cdc61f64140450baf18746ee5a
We were really hoping we could use the PID to create the OCFL Object Root instead of what looks like it might be a UUID? So in this case some variation of:
ocfl/bk8/54w/j71/bk854wj7180
It looks like --id-prefix
just controls whether info:fedora/
is included in the id
property found in the Object Inventory (inventory.json)?
Yes, that's correct. --id-prefix
only affects the OCFL object id and does not affect the OCFL storage layout. It is possible to change the layout, but doing so is a little more involved.
I believe this is the storage layout that you're wanting: https://ocfl.github.io/extensions/0007-n-tuple-omit-prefix-storage-layout.html
In order to use this layout, you have to do the following:
0=ocfl_1.0
fileocfl_layout.json
with the following content:
{
"extension" : "0007-n-tuple-omit-prefix-storage-layout",
"description" : "This storage root extension describes an OCFL storage layout combining a pairtree-like root directory structure derived from prefix-omitted object identifiers, followed by the prefix-omitted object identifier themselves. The OCFL object identifiers are expected to contain prefixes which are removed in the mapping to directory names. The OCFL object identifier prefix is defined as all characters before and including a configurable delimiter. Where the prefix-omitted identifier length is less than tuple size * number of tuples, the remaining object id (prefix omitted) is left or right-side, zero-padded (configurable, left default), or not padded (none), and optionally reversed (default false). The object id is then divided into N n-tuple segments, and used to create nested paths under the OCFL storage root, followed by the prefix-omitted object id directory."
}
extensions/0007-n-tuple-omit-prefix-storage-layout
config.json
with the following content:
{
"delimiter" : ":",
"tupleSize" : 3,
"numberOfTuples" : 3,
"zeroPadding" : "left",
"reverseObjectRoot" : false,
"extensionName" : "0007-n-tuple-omit-prefix-storage-layout"
}
Now, it will not create a new OCFL repository and will write to the one you manually created, using the layout you specified.
However, I suspect it will not work correctly without upgrading the version of ocfl-java. It's currently using version 1.4.3
, which has a bug for this particular layout. You should be able to override the version by adding the following the migration-utils pom:
<dependency>
<groupId>edu.wisc.library.ocfl</groupId>
<artifactId>ocfl-java-core</artifactId>
<version>1.4.6</version>
</dependency>
We will get the version Fedora is using updated sometime before the next release.
[Edit] To be clear, if you want that layout you need to both set --id-prefix
to an empty string and manually set the OCFL storage layout
Thanks @pwinckles -- I think I followed your instructions above, but I ran into this error when running with the new jar:
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions prior to 1.8.
SLF4J: Ignoring binding found at [jar:file:/data/ocfl/migration-utils-6.2.0-SNAPSHOT-driver.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#ignoredBindings for an explanation.
class org.slf4j.helpers.NOPLoggerFactory cannot be cast to class ch.qos.logback.classic.LoggerContext (org.slf4j.helpers.NOPLoggerFactory and ch.qos.logback.classic.LoggerContext are in unnamed module of loader 'app')
java.lang.ClassCastException: class org.slf4j.helpers.NOPLoggerFactory cannot be cast to class ch.qos.logback.classic.LoggerContext (org.slf4j.helpers.NOPLoggerFactory and ch.qos.logback.classic.LoggerContext are in unnamed module of loader 'app')
at org.fcrepo.migration.PicocliMigrator.setDebugLogLevel(PicocliMigrator.java:201)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:211)
at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
at picocli.CommandLine.access$900(CommandLine.java:145)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
at picocli.CommandLine.execute(CommandLine.java:1864)
at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)
Maybe there's something else that needs to be added to the pom?
Try it with:
<dependency>
<groupId>edu.wisc.library.ocfl</groupId>
<artifactId>ocfl-java-core</artifactId>
<version>1.4.6</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
This worked perfectly @pwinckles!
I'm going to re-run the 50k test with these options.
Since the setup got a bit more complicated I've created a repo to track the steps:
This Spike is hereby declared finished!
Test migration-utils as an alternative to
bin/archive-fedora
.For this test, export 250K objects to OCFL using sdr-deploy server. Please estimate how much space and time would be required for 4.3M objects.