sul-dlss / web-archiving

placeholder for web archiving work
0 stars 0 forks source link

analyze best way(s) to increase stacks storage #11

Closed ndushay closed 7 years ago

ndushay commented 7 years ago

goal: accession another 7T of WARC from CDL server.

current thoughts: we have a single mount point that can't be grown, so can we have multiple mount points? Is this the only/best solution?

jira ticket DEVQUEUE-96

This task is a blocker for accessioning of the remaining less than 5 TB of CDL WAS content and likely another ~15 TB of content still on Archive-It.

drh-stanford commented 7 years ago

my understanding is that this ticket is to increase the storage for web-archiving-stacks not SDR (i.e., sdr-preservation-core). (right, @nullhandle ?)

i chatted with Kam about this and we will need to have our application be able to manage multiple NFS mount points and treat them as a single (logical) volume.

this is a common problem, in general, and others have solved it at the file-system level (e.g., ZFS) by presenting multiple disks as a single volume. however, the catch is that we need to do this with NFS mounts, not physical disks, so our requirements are different.

so, the next step is to investigate how sdr-preservation-core implements this because it appears that they do (they have ~10 NFS mounts for storage).

ndushay commented 7 years ago

for github ticket

drh-stanford commented 7 years ago

I have a draft email for the dev list here -- https://gist.github.com/drh-stanford/b037af8b560d313542639d07f8f6156d

nullhandle commented 7 years ago

@drh-stanford yes, we're talking about web archiving stacks here, not preservation core.

e-mail looks good. the forum to send it to is: https://groups.google.com/forum/#!forum/openwayback-dev.

drh-stanford commented 7 years ago

ok, so I dug into SDR preservation core and it looks like the storage algorithm is driven by application logic in the moab gem. See https://github.com/sul-dlss/moab-versioning/blob/master/lib/moab/storage_repository.rb#L71-L92 for example.

it's a pretty simple algorithm that we could mimic in our code, but not directly reuse their code.

The basics are that it's configured to know about all of the "storage roots", i.e., NFS mount points. and it searches each mount point for a given druid. if it doesn't find that druid anywhere it uses the last configured volume as the default mount point into which data is deposited.

to add a new mount point (e.g., when the last mount point is at 75% full), it's a configuration change for the storage roots and then restart the robots. for details see https://consul.stanford.edu/display/Jumbo/Adding+new+storage+mounts+to+Preservation+Core

drh-stanford commented 7 years ago

i should also note that SDR has a variety of custom admin tools that lets them monitor disk storage usage and locating where a druid is stored across the storage array.

drh-stanford commented 7 years ago

i looked into how cdx-server locates the WARC files on the file system, and there's a path-index.txt mapping file located here: /web-archiving-stacks/data/indices/path/path-index.txt

for example:

ARCHIVEIT-1023-20090309170803-00000-crawling04.us.archive.org.warc.gz   /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309170803-00000-crawling04.us.archive.org.warc.gz
ARCHIVEIT-1023-20090309170804-00000-crawling04.us.archive.org.arc.gz    /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309170804-00000-crawling04.us.archive.org.arc.gz
ARCHIVEIT-1023-20090309171826-00001-crawling04.us.archive.org.warc.gz   /web-archiving-stacks/data/collections/wx013sp9826/sd/008/df/6442/ARCHIVEIT-1023-20090309171826-00001-crawling04.us.archive.org.warc.gz

and it has 200k entries:

$ wc -l path-index.txt 
197778 path-index.txt
ndushay commented 7 years ago

would it make sense to do a separate "collection" for each druid? https://github.com/iipc/openwayback/wiki/Configure-Multiple-Access-Points-For-Multiple-CDX-Collections

drh-stanford commented 7 years ago

I found the configuration that we're using if the openwayback fork:

  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
    <property name="path" value="${wayback.basedir}/indices/path/path-index.txt" />
  </bean>

          <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex">
            <property name="path" value="${wayback.basedir}/indices/cdx/index.cdx" />
          </bean>

Note that we're using Jenkins -- https://jenkinsqa.stanford.edu/job/Stanford%20OpenWayback/configure

ndushay commented 7 years ago

if we did it by the druid, and used the dynamic "watchedCDXSource" approach https://github.com/iipc/openwayback/wiki/WatchedCDXSource:-Dynamically-Adding-CDX-Indexes ... wouldn't that work swimmingly?

drh-stanford commented 7 years ago

Yes, I added that to my email to wayback-dev. It looks like WatchedCDXSource is the preferred method but it doesn't seem more scalable to me -- if we had 100s of .cdx files rather than one big one?

drh-stanford commented 7 years ago

See SAR-3338 for having a dedicate mount for the CDX indexes.

drh-stanford commented 7 years ago

See https://github.com/sul-dlss/web-archiving/wiki/Storage for an initial writeup with the proposals

drh-stanford commented 7 years ago

Analysis is done, and now we need to flesh out the proposals that are not specific to web-archiving. See #20 for further work on this topic.