Closed ndushay closed 7 years ago
my understanding is that this ticket is to increase the storage for web-archiving-stacks
not SDR (i.e., sdr-preservation-core
). (right, @nullhandle ?)
i chatted with Kam about this and we will need to have our application be able to manage multiple NFS mount points and treat them as a single (logical) volume.
this is a common problem, in general, and others have solved it at the file-system level (e.g., ZFS) by presenting multiple disks as a single volume. however, the catch is that we need to do this with NFS mounts, not physical disks, so our requirements are different.
so, the next step is to investigate how sdr-preservation-core implements this because it appears that they do (they have ~10 NFS mounts for storage).
I have a draft email for the dev list here -- https://gist.github.com/drh-stanford/b037af8b560d313542639d07f8f6156d
@drh-stanford yes, we're talking about web archiving stacks here, not preservation core.
e-mail looks good. the forum to send it to is: https://groups.google.com/forum/#!forum/openwayback-dev.
ok, so I dug into SDR preservation core and it looks like the storage algorithm is driven by application logic in the moab gem. See https://github.com/sul-dlss/moab-versioning/blob/master/lib/moab/storage_repository.rb#L71-L92 for example.
it's a pretty simple algorithm that we could mimic in our code, but not directly reuse their code.
The basics are that it's configured to know about all of the "storage roots", i.e., NFS mount points. and it searches each mount point for a given druid. if it doesn't find that druid anywhere it uses the last configured volume as the default mount point into which data is deposited.
to add a new mount point (e.g., when the last mount point is at 75% full), it's a configuration change for the storage roots and then restart the robots. for details see https://consul.stanford.edu/display/Jumbo/Adding+new+storage+mounts+to+Preservation+Core
i should also note that SDR has a variety of custom admin tools that lets them monitor disk storage usage and locating where a druid is stored across the storage array.
i looked into how cdx-server locates the WARC files on the file system, and there's a path-index.txt
mapping file located here: /web-archiving-stacks/data/indices/path/path-index.txt
for example:
ARCHIVEIT-1023-20090309170803-00000-crawling04.us.archive.org.warc.gz /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309170803-00000-crawling04.us.archive.org.warc.gz
ARCHIVEIT-1023-20090309170804-00000-crawling04.us.archive.org.arc.gz /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309170804-00000-crawling04.us.archive.org.arc.gz
ARCHIVEIT-1023-20090309171826-00001-crawling04.us.archive.org.warc.gz /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309171826-00001-crawling04.us.archive.org.warc.gz
and it has 200k entries:
$ wc -l path-index.txt
197778 path-index.txt
would it make sense to do a separate "collection" for each druid? https://github.com/iipc/openwayback/wiki/Configure-Multiple-Access-Points-For-Multiple-CDX-Collections
I found the configuration that we're using if the openwayback fork:
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="${wayback.basedir}/indices/path/path-index.txt" />
</bean>
<bean class="org.archive.wayback.resourceindex.cdx.CDXIndex">
<property name="path" value="${wayback.basedir}/indices/cdx/index.cdx" />
</bean>
Note that we're using Jenkins -- https://jenkinsqa.stanford.edu/job/Stanford%20OpenWayback/configure
if we did it by the druid, and used the dynamic "watchedCDXSource" approach https://github.com/iipc/openwayback/wiki/WatchedCDXSource:-Dynamically-Adding-CDX-Indexes ... wouldn't that work swimmingly?
Yes, I added that to my email to wayback-dev. It looks like WatchedCDXSource
is the preferred method but it doesn't seem more scalable to me -- if we had 100s of .cdx
files rather than one big one?
See SAR-3338 for having a dedicate mount for the CDX indexes.
See https://github.com/sul-dlss/web-archiving/wiki/Storage for an initial writeup with the proposals
Analysis is done, and now we need to flesh out the proposals that are not specific to web-archiving. See #20 for further work on this topic.
goal: accession another 7T of WARC from CDL server.
current thoughts: we have a single mount point that can't be grown, so can we have multiple mount points? Is this the only/best solution?
jira ticket DEVQUEUE-96