need a hierarchical identification structure for contents of ownCloud

uchicago-library / digcoll_retriever

A retriever meant to allow API access to contents of owncloud for arbitrary collecition or exhibition interfaces

GNU General Public License v3.0

1 stars 0 forks source link

need a hierarchical identification structure for contents of ownCloud #1

Open verbalhanglider opened 6 years ago

verbalhanglider commented 6 years ago

Starting with this assumption:

mvol/0001/0002/0003 which would refer to mvol publication 1, volume 2, issue 3

Such that mvol/0001/0002/0003/pdf => return the PDF file /tiffs => return the tiffs for that issue /jejocr => return the OCR file that is needed by XTF to create the index needed by the interface /ocr => return OCR data generated by preservation department /jpg => return dynamically generated JPEG file

would retrieve relevant items from that issue.

bnbalsamo commented 6 years ago

I've done rather a lot of thinking about this identifier business since our meeting, and I'd like to lay it out because I think the issue may be more nuanced than originally believed.

I believe we have two major ways we can view identifiers:

1) Hierarchical listings

mvol/0001/0002/0003 where it is possible to get a listing from mvol/0001 which includes 0002, and potentially 0001 and 0004 etc. mapped respectively to mvol/0001/0002/, mvol/0001/0001, mvol/0001/0004.

This assumption results in a sub decision that must be made, which will inform how I define the URL rules.

These two options can be roughly described as either using the URL as a proxy for the path, or URL encoding a path and passing it to an endpoint. Because URLs != paths in certain respects, these each provide different pros/cons

1.1) URLs in the form http://$HOST/mvol/0001/0002/0003/jpg

This assumption mandates hierarchical identifiers of the same depth for everything eg: each identifier must be a namespace followed by three subparts. The resulting flask URL rule and endpoint would look something like....

class NavDepthNamespace(Resource):
    def get(self, namespace):
        nav_path = pathlib.join(BLUEPRINT.config['OWNCLOUD_ROOT'], namespace)
        # Go get and return a listing of subnodes for this path

class GetJpg(Resource):
    def get(self, namespace, id_part1, id_part2, id_part3):
        resource_path = pathlib.join(
                            BLUEPRINT.config['OWNCLOUD_ROOT'],
                            namespace,
                            id_part1,
                            id_part2,
                            id_part3
        )
        # Do things with the path, etc etc

API.add_resource(
    GetJpg, 
    "/<string:namespace>/<string:id_part1>/<string:id_part2>/<string:id_part3>/jpg"
)

This solution would also call for hardcoding endpoints similar to this, for navigability:

API.add_resource(
    NavDepthNamespace,
    "/<string:namespace>"
)

API.add_resource(
    NavDepth1,
    "/<string:namespace>/<string:id_part1>:
)

API.add_resource(
    NavDepth2,
    "/<string:namespace>/<string:id_part1>/<string:id_part2>"
)

API.add_resource(
    NavDepth3,
    "/<string:namespace>/<string:id_part1>/<string:id_part2>/<string:id_part3>"
)

1.2) URLs in the form http://$HOST/mvol%2F0001%2F0002%2F0003/jpg

This assumption means that identifier "paths" can be arbitrary depths. It's implementation would look more like...

class Nav(Resource):
    def get(self, quoted_identifier):
        hierarchical_identifier = urllib.parse.unquote(quoted_identifier)
        # Take the identifier hierarchy, navigate it, return child nodes

class GetJpg(Resource):
    def get(self, quoted_identifier):
        hierarchical_identifier = urllib.parse.unquote(identifier)
        # Take the identifier, go find the thing at it, return/create and return the JPG

API.add_resource(
    Nav,
    "/<string:quoted_identifier>"
)

API.add_resource(
    GetJpg,
    "/<string:quoted_identifier>/jpg"
)

2) Atomic identifiers

mvol/0001/0002/0003 is a single unit, mvol/0001/0002 is (most likely) an invalid resource identifier, or potentially another unrelated resource. In this case identifiers are not hierarchical, and therefore not intrinsically navigable.

For safeties sake I would again recommend URL escaping the identifiers for transit, though it wouldn't necessarily be required if we avoided certain characters. My examples will operate with the assumption that we would url encode them.

This results in the simplest implementation

class GetJpg(Resource):
    def get(self, quoted_identifier):
        identifier = urllib.parse.unquote(identifier)
        # Take the identifier, go find the thing at it, return/create and return the JPG

API.add_resource(
    GetJpg,
    "/<string:quoted_identifier>/jpg"
)

The big change here is because the identifiers are no longer navigable, and creating a flat listing of all identifiers would be too time consuming to do dynamically, @johnjung would be reliant on manifests created by @c-blair / kfeeney / kar8 in order to know what to populate interfaces with. These manifests would have to include every complete resource identifier. Extrapolation of further organizational meaning from the identifier would be left to the interface (eg, grouping by 0001/0002 or what have you).

This manifest would be very similar to the manifests we are using currently the validation workflows, and in fact could probably be used for the same purpose in this system.

Conclusion

In order of preference I'd pull for solutions... 1) 2 2) 1.2 3) 1.1

My preference for solution two is driven by a few factors.

1) It's the least functionality all rolled up into one tool.* 2) It makes the most sense for identifiers that don't currently have hierarchys, eg the finding aid material

or material which doesn't make sense in the same hierarchy depth as mvols
Code reuse 3) From The Zen of Python
Explicit is better than implicit --> Explicit manifests are better than implicit path traversal
Flat is better than nested --> No hierarchies, no hierarchy complication 4) We get to "double dip" on explicit manifests for good validation/completeness measures, rather than just validation of "what's there" 5) It's easily extensible out of band. If we wanted we could use an entirely separate system, driven by the same manifests that the interface may rely on, to provide hierarchical structure to identifiers. 6) It is the most loosely coupled implementation, relying very little on the "backdoor" knowledge that the files are on a file system in directories.
This also has the nice side effect that it would be least disk intensive - making it probably the fastest of all the possible implementations

What are others thoughts on this? @c-blair @johnjung @verbalhanglider - We might also consider looping kefeeney, kar8, and jdartt on discussion of manifests, if we determine to go that route.

bnbalsamo commented 6 years ago

Another point in favor of the atomic identifiers which just occurred to me is that they handle the case of the PDFs having one fewer identifier component than the tifs/jpgs just fine as well. This would complicate the URL matching a lot in case 1.1, and a little bit in case 1.2

bnbalsamo commented 6 years ago

A bit of end of the day inspiration has struck.

So long as we went with URL escaping the identifiers, it would be possible to hit a sort of middle ground between 1.2 and 2 by implementing something like....

class Nav(Resource):
    def get(self, quoted_identifier):
        identifier = urllib.parse.unquote(identifier)
        # query (an optional?) hierarchical info storage system and return
        # a listing of immediate sub-identifiers, otherwise return None?

API.add_resource(
    Nav,
    "/<string:quoted_identifier>/nav"
)

This may introduce quite a bit of complexity under the hood if we implement many different hierarchical specifications/implementations, but it might be a good way to split the difference and leave the system extensible

verbalhanglider commented 6 years ago

in the case of atomic identifiers, couldn't we have varying kinds of ids. that is some point to a specific file, others some group of files.

these groupings could be defined in some other system like you described in the last comment. another system like the arbitrary groupings system from a year ago.

c-blair commented 6 years ago

Currently, here is a persistent link to a page of a Maroon issue: http://pi.lib.uchicago.edu/1001/dig/campub/mvol-0004-1902-1001/ http://pi.lib.uchicago.edu/1001/dig/campub/mvol-0004-1902-1001/44 A link to that volume: http://pi.lib.uchicago.edu/1001/dig/campub/mvol-0004-1902-1001 A link to the title: http://pi.lib.uchicago.edu/1001/dig/campub/mvol-0004

The last one doesn't work, but that's just a bug in John's code; it should.

I'm not sure I understand the context for this discussion. Something like mvol-0004[etc.] needs to be "namespaced". This is what "campub" in the above does. "1001" simply reflects our Handle System (TM) handle. An ARK identifier would have 61001 instead. (We're still using ARKs for LDR/DA?)

On Thu, Jul 13, 2017 at 5:01 PM, Tyler Danstrom notifications@github.com wrote:

in the case of atomic identifiers, couldn't we have varying kinds of ids. that is someight point to a specific file, others some group of files.

these groupings could be defined in some other system like you described in the last comment. another system like the arbitrary groupings system from a year ago.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-315214440, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzCsMmGTMu0AGqQM2AJ5XTGNXwFHsks5sNpO9gaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/

verbalhanglider commented 6 years ago

We have moved away from ARKs because real-world workflows between the producers of digitized content and the storage and the display of content in collection/exhibit web sites do not meet the requirements for ARKs to be a realistic or adequate solution to our needs.

Instead, we are going with the decision we agreed to yesterday that identifiers are atomic and reference a particular thing whether that thing is a particular file or a collection of files like an issue of a magazine or a particular book.

Is there a listing of all namespaces currently used in the pi.lib system? Can you post that to this thread, @c-blair?

c-blair commented 6 years ago

I said ARKs in the LDR/DA. Your response is about the LDR/DC. Did you intend it to be wider?

A page to know about: https://dldc.lib.uchicago.edu/dl/. On that page, kept up to date, is https://dldc.lib.uchicago.edu/dl/909registry.html.

On Tue, Jul 18, 2017 at 10:51 AM, Tyler Danstrom notifications@github.com wrote:

We have moved away from ARKs because real-world workflows between the producers of digitized content and the storage and the display of content in collection/exhibit web sites do not meet the requirements for ARKs to be a realistic or adequate solution to our needs.

Instead, we are going with the decision we agreed to yesterday that identifiers are atomic and reference a particular thing whether that thing is a particular file or a collection of files like an issue of a magazine or a particular book.

Is there a listing of all namespaces currently used in the pi.lib system? Can you post that to this thread, @c-blair https://github.com/c-blair?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-316108557, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzEAHh0oYcTXrwQRA4tlRj7xWvMEVks5sPNR_gaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/

verbalhanglider commented 6 years ago

There will actually be two separate identifer schemes: one for the LDR/DA and one for the LDR/DC. Neither can use ARKs.

LDR/DC can't use ARKs because current workflows means metadata is too varied across all the collections in the digital collections pool and because where ARKs seem to assume a given file will have a descriptive record very few of the files in LDR/DC have description of any kind.

LDR/DA can't use ARKs because none of the files in that pool have any kind of descriptive metadata. What metadata can be assumed is strictly technical metadata about the files.

LDR/DA identifiers are UUIDs generated on ingest

LDR/DC identifiers need to be formalized so that a system can be created that is transparent to all parties involved (current and future) and that automated workflows can be created for sharing assets whether as a particulr file or as a collection of files (like the issue use case stated earlier).

bnbalsamo commented 6 years ago

in the case of atomic identifiers, couldn't we have varying kinds of ids. that is some point to a specific file, others some group of files.

these groupings could be defined in some other system like you described in the last comment. another system like the arbitrary groupings system from a year ago.

This is definitely possible. We're getting close to a sort of "type" (ala, OO) system here with the types being....

tiff ocr jpg metadata etc

We could call this kind of "thing" in the storage another type: "group"? potentially with the addendum "group/metadata" or "group/note?"

That could then produce, say, a JSON record with relative links. This would be dead simple to implement via a sidecar mongo database.

I'd definitely call this functionality a version > 1.0 target though.

I'd suggest after our identifier issues are set in stone we return to this in another issue post in the future

c-blair commented 6 years ago

"in the case of atomic identifiers, couldn't we have varying kinds of ids. that is some point to a specific file, others some group of files."

The idea is fine, but how do create groups of files? Ad-hocishly? I've done it systematically, by mapping an ontology over our digital collections. Conceptually, a repository is a recursive structure. It consists of either a collection, or a file (the base case). Collections can include collections. The repository is the collection of all collections. The following is hard to read, but it is along these lines.

https://dldc.lib.uchicago.edu/dl/collections/digcollingestspecifications.html#sec-4-2

The bottom line is that a lot of work has been done here already. I would like to review it, and see how it can be improved upon. If it can't, we can scrap it, but I want to proceed methodically.

"these groupings could be defined in some other system like you described in the last comment. another system like the arbitrary groupings system from a year ago."

I'm not sure what this is. In the case of the Maroon, we have a perfect example of what I was referring to above. We have a(n ur-)collection: the LDR. We have a collection as conventionally understood: Campus Publications. We have a title, which is a collection of issues: The Maroon. We have an issue, which is a collection of pages. We have a page. A page itself is a collection, if only of a TIFF image and the OCR. If we generate JPEGs on the fly, then that's a virtual file, I suppose (but a case to consider).

On Tue, Jul 18, 2017 at 11:24 AM, Brian Balsamo notifications@github.com wrote:

in the case of atomic identifiers, couldn't we have varying kinds of ids. that is some point to a specific file, others some group of files.

these groupings could be defined in some other system like you described in the last comment. another system like the arbitrary groupings system from a year ago.

This is definitely possible. We're getting close to a sort of "type" (ala, OO) system here with the types being....

tiff ocr jpg metadata etc

We could call this kind of "thing" in the storage another type: "group"? potentially with the addendum "group/metadata" or "group/note?"

That could then produce, say, a JSON record with relative links. This would be dead simple to implement via a sidecar mongo database.

I'd definitely call this functionality a version > 1.0 target though.

I'd suggest after our identifier issues are set in stone we return to this in another issue post in the future

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-316118862, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzHTjZroA6oJzZ7ouf27lawSsy5jGks5sPNxUgaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/

verbalhanglider commented 6 years ago

So, just to post for the record: we've decided to proceed designing the digcoll_retriever with an identifier scheme. Once we have that design, we can write some basic tests as the first "coded" prototype and use the design and the tests as a launching point for a discussion comparing what is being offered currently versus what the digcoll_retriever design will provide.

After that discussion we can decide whether or not to proceed with the digcoll_retriever or to simply modify the pi-lib source code to use owncloud as storage location for files related to the "dig" namespace

See pi-lib repo

c-blair commented 6 years ago

"... or to simply modify the pi-lib source code to use owncloud as storage location for files related to the "dig" namespace" Or, as I said on slack, to have pi.lib point to another web server, as is being done with storage.lib. In effect, the owncloud location supersedes the disk space that storage.lib is looking at, so storage.lib could be modified to have owncloud as its docroot.

On Tue, Jul 18, 2017 at 1:25 PM, Tyler Danstrom notifications@github.com wrote:

So, just to post for the record: we've decided to proceed designing the digcoll_retriever with an identifier scheme. Once we have that design, we can write some basic tests as the first "coded" prototype and use the design and the tests as a launching point for a discussion comparing what is being offered currently versus what the digcoll_retriever design will provide.

After that discussion we can decide whether or not to proceed with the digcoll_retriever or to simply modify the pi-lib source code to use owncloud as storage location for files related to the "dig" namespace

See pi-lib repo https://github.com/uchicago-library/pi-lib

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-316152988, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzP1FmmtdaAmQ9p8akUK_PiUG8v_Yks5sPPiYgaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/

verbalhanglider commented 6 years ago

We can make these modifications to the pi-lib source code or make the storage.lib.uchicago.edu web server be replaced (or get superseded by) owncloud data storage location. If it comes to that, we can make that change in very little time. There is minimal to zero risk in this option and it would mean continuing with conditions as they stand today.

However, while the digcoll_retriever does have slightly more risk than the option described in the above paragraph, the risk is still minimal. We can have a minimal viable product sharing campus publication files with the basic functionality in the design document defined here in 2 weeks of development. In that 2 weeks we can have a working demo that will show

dropping a file into owncloud
showing the file become immediately available in a demo website hosted on a digital ocean droplet
making a new version of the file in owncloud
show the file immediately get updated on the demo website

We will demonstrate this affect with two types of files:

descriptive metadata that gets parsed and displayed on the website hosted on digital ocean
an image file that gets rendered on the website hosted on digital ocean

This demonstration can be rapidly expanded to include all of the functionality in the design document including the endpoints marked as advanced in the design document

If after the demonstration it is agreed to move forward, we can build the advanced endpoints and once the digcoll_retriever is ready for production then:

pi-lib can be modified to redirect campub urls to the appropriate digcoll_retriever url

This option would require another 2 weeks of development to finish the remaining endpoints and and an unknown amount of sysadmin time (my estimation is once it was put on the docket it wouldn't take longer than another 2 weeks) to put into place the infrastructure necessary to do this.

If after the demonstration it is agreed not to move forward with the digcoll_retriever, we can modify the pi-lib source to pull files from owncloud storage instead of storage.lib.uchicago.edu storage. This can be done one of two ways

mount owncloud data read-only to the server that pi-lib runs on and modify the request for campub urls to extract the resource from the owncloud file location
do not mount owncloud data read-only to the server and instead modify pi-lib to return content from owncloud rather than campub

If we decide to go this route, I would recommend just mounting owncloud read-only to pi-lib and you modify pi-lib to retrieve campub files from the owncloud data storage.

This option would probably take sysadmins about a day to mount owncloud read-only to the server that pi-lib is running on and it would take you however much development time you estimate it woud take to modify the code receiving campub urls and retrieving the files.

I would like to go with the former option of building the digcoll_retriever demo/prototype in 2 weeks and the production digcoll_retriever in another 2+ weeks because if we do go with that route we can do the following:

reduce the overhead required on developers to create exhibit and/or collection websites around our content
reduce the overhead required on maintainers to ensure that the right file is being displayed the right way in the right interface
reduce the overhead required on system administrators to track that storage is being used safely and securely

All of this reduced overhead on resources would allow us to put the freed resources to solving other problems and in particular it would allows to more quickly share our digital resources with the campus community in particular and the academic community much faster than we are currently doing so

c-blair commented 6 years ago

Comments interspersed below.

On Tue, Jul 18, 2017 at 8:35 PM, Tyler Danstrom notifications@github.com wrote:

We can make these modifications to the pi-lib source code or make the storage.lib.uchicago.edu web server be replaced (or get superseded by) owncloud data storage location. If it comes to that, we can make that change in very little time. There is minimal to zero risk in this option and it would mean continuing with conditions as they stand today.

However, while the digcoll_retriever does have slightly more risk than the option described in the above paragraph, the risk is still minimal. We can have a minimal viable product sharing campus publication files with the basic functionality in the design document defined here https://github.com/uchicago-library/digcoll_retriever/wiki/Design-document in 2 weeks of development. In that 2 weeks we can have a working demo that will show

dropping a file into owncloud

showing the file become immediately available in a demo website hosted on a digital ocean droplet

making a new version of the file in owncloud

show the file immediately get updated on the demo website

I prefer this latter approach.

We will demonstrate this affect with two types of files:

descriptive metadata that gets parsed and displayed on the website hosted on digital ocean

an image file that gets rendered on the website hosted on digital ocean

2 I can see. 1 I'm not so sure of. What is an example of 1? Normally descriptive metadata are part of the processing for the interface. Other examples of 2 would be links to the OCR (doesn't have to be JEJ OCR for a demo), a tif image, and a PDF for an issue.

This demonstration can be rapidly expanded to include all of the functionality in the design document including the endpoints marked as advanced in the design document

If after the demonstration it is agreed to move forward, we can build the advanced endpoints and once the digcoll_retriever is ready for production then:

pi-lib can be modified to redirect campub urls to the appropriate digcoll_retriever url

This option would require another 2 weeks of development to finish the remaining endpoints and and an unknown amount of sysadmin time (my estimation is once it was put on the docket it wouldn't take longer than another 2 weeks) to put into place the infrastructure necessary to do this.

If after the demonstration it is agreed not to move forward with the digcoll_retriever, we can modify the pi-lib source to pull files from owncloud storage instead of storage.lib.uchicago.edu storage. This can be done one of two ways

mount owncloud data read-only to the server that pi-lib runs on and modify the request for campub urls to extract the resource from the owncloud file location

do not mount owncloud data read-only to the server and instead modify pi-lib to return content from owncloud rather than campub

If we decide to go this route, I would recommend just mounting owncloud read-only to pi-lib and you modify pi-lib to retrieve campub files from the owncloud data storage.

This option would probably take sysadmins about a day to mount owncloud read-only to the server that pi-lib is running on and it would take you however much development time you estimate it woud take to modify the code receiving campub urls and retrieving the files.

I would like to go with the former option of building the digcoll_retriever demo/prototype in 2 weeks and the production digcoll_retriever in another 2+ weeks because if we do go with that route we can do the following:

reduce the overhead required on developers to create exhibit and/or collection websites around our content

reduce the overhead required on maintainers to ensure that the right file is being displayed the right way in the right interface

reduce the overhead required on system administrators to track that storage is being used safely and securely

All of this reduced overhead on resources would allow us to put the freed resources to solving other problems and in particular it would allows to more quickly share our digital resources with the campus community in particular and the academic community much faster than we are currently doing so

I prefer the second approach as well, because it is more powerful in the long run. It is always nice to have a fallback option that one can describe, but I don't believe in being so safe as to be stagnant. Prudent, achievable, (even exciting) yes. Stagnant (and stifling), no.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-316246548, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzFY5D7oYIva7oPdfrvLHRqEYWl5fks5sPV1KgaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/

verbalhanglider commented 6 years ago

So, to summarize again, and get this stated for the record. We have three options going forward

Rely exclusively on pi-lib and current workflows to get work done. These workflows would be unchanged for maintainers of digital collection files, developers of digital collection or exhibit websites, and for sysadmins.
Rely on pi-lib program and modify current workflows for maintainers of files for digital collections or digital exhibits by requiring them to use ownCloud but keep unchanged the workflows for developers of digital collections or exhibit websites and for the sysadmins.
Rely heavily on the new digcoll_retriever and change the workflows for all three roles of staff involved in digital collections or websites. This reliance could either be digcoll_retriever is the exclusive means of retrieving files from digital collections or exhibits for the purpose of using them in websites, or this could be a hybrid of digcoll_retriever and the pi-lib program

If we go with 1, we stop work on digcoll_retriever and abandon ownCloud. Instead, we should use the file system in place for ownCloud as a samba share that anybody who needs to add files to a digital collection would have authority to mount to their workstation and add and delete files to and from. This samba share would be mounted read-write to the server running the pi-lib program and the developer of the pi-lib program would take on the responsibility of adding new PI links to the pi-lib program for every collection added to the samba share.

If we go with 2, we can keep ownCloud and it would become the interface for producers digital collections or exhibits to to do their work. We would mount the owncloud filesystem read-only to the server running the pi-lib program. The developer of the pi-lib program would modify that program to use the ownCloud file structure to retrieve files to generate PI links for digital collection files. Developers of digital collections or exhibit websites would need to b given read-access to the samba share as well so that they could make copies of the files that they need for the website that they are tasked with. Sysadmins would simply need to ensure that the samba share is replicated and optionally that the replication is mirrored to the cloud.

If we go with 3, we keep ownCloud, which becomes the interface for producers of digital collections and exhibits, allow developers of digital collection or exhibit websites to be freed from having the responsibility of maintaining the files necessary for their website as well as being responsible for writing the code, and we may change the workflows for sysadmins responsible for maintaining the storage appliance and backups.

My advice as stated in the earlier comment is to go with 3 for the reasons I stated in that comment.

If however we decide not to go with that option 3 than my advice is to go with option 1. This option requires no substantive change to workflows for anyone and can be maintained for the short to medium future with no changes required to infrastructure whether that is computer hardware, computer software or skills required by staff. This has the downside that we will continue to consume lots of overhead on the part of maintainers, developers and sysadmins responding to every error or adjustment in any workflow and given the rate of change in the technology industry it can be assumed that these errors and adjustments is likely to increase. This would mean that an increasing amount of overhead would be consumed by this with every year and eventually it would reach a breaking point where we would have to make some very hard decisions about the future of digital collections and exhibits. This is the reason I recommend going with option 3.

Which option should we move forward with?

c-blair commented 6 years ago

Nice to see the options clearly laid out and argued. As I said earlier, we have decided on option 3 as the preferred option.

On Wed, Jul 19, 2017 at 10:10 AM, Tyler Danstrom notifications@github.com wrote:

So, to summarize again, and get this stated for the record. We have three options going forward

1.

Rely exclusively on pi-lib and current workflows to get work done. These workflows would be unchanged for maintainers of digital collection files, developers of digital collection or exhibit websites, and for sysadmins. 2.

Rely on pi-lib program and modify current workflows for maintainers of files for digital collections or digital exhibits by requiring them to use ownCloud but keep unchanged the workflows for developers of digital collections or exhibit websites and for the sysadmins. 3.

Rely heavily on the new digcoll_retriever and change the workflows for all three roles of staff involved in digital collections or websites. This reliance could either be digcoll_retriever is the exclusive means of retrieving files from digital collections or exhibits for the purpose of using them in websites, or this could be a hybrid of digcoll_retriever and the pi-lib program

If we go with 1, we stop work on digcoll_retriever and abandon ownCloud. Instead, we should use the file system in place for ownCloud as a samba share that anybody who needs to add files to a digital collection would have authority to mount to their workstation and add and delete files to and from. This samba share would be mounted read-write to the server running the pi-lib program and the developer of the pi-lib program would take on the responsibility of adding new PI links to the pi-lib program for every collection added to the samba share.

If we go with 2, we can keep ownCloud and it would become the interface for producers digital collections or exhibits to use ownCloud to do their work. We would mount the owncloud filesystem read-only to the server running the pi-lib program. The developer of the pi-lib program would modify that program to use the ownCloud file structure to retrieve files to generate PI links for digital collection files.

If we go with 3, we keep ownCloud, which becomes the interface for producers of digital collections and exhibits, allow developers of digital collection or exhibit websites to be freed from having the responsibility of maintaining the files necessary for their website as well as being responsible for writing the code, and we may change the workflows for sysadmins responsible for maintaining the storage appliance and backups.

My advice as stated in the earlier comment is to go with 3 for the reasons I stated in that comment.

If however we decide not to go with that option 3 than my advice is to go with option 1. This option requires no substantive change to workflows for anyone and can be maintained for the short to medium future with no changes required to infrastructure whether that is computer hardware, computer software or skills required by staff. This has the downside that we will continue to consume lots of overhead on the part of maintainers, developers and sysadmins responding to every error or adjustment in any workflow and given the rate of change in the technology industry it can be assumed that these errors and adjustments is likely to increase. This would mean that an increasing amount of overhead would be consumed by this with every year and eventually it would reach a breaking point where we would have to make some very hard decisions about the future of digital collections of exhibits. This is the reason I recommend going with option 3.

Which option should we move forward with?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uchicago-library/digcoll_retriever/issues/1#issuecomment-316418473, or mute the thread https://github.com/notifications/unsubscribe-auth/APhGzMCFc4VytatSeSYyfvvH_Bb69GCJks5sPhxRgaJpZM4OWIkW .

-- Charles Blair Director, Digital Library Development Center University of Chicago Library 1 773 702 8459 | chas@uchicago.edu | http://www.lib.uchicago.edu/~chas/