Closed chrisdaaz closed 2 years ago
@chrisdaaz Looking over the XML DTD (can be accessed at https://secure.etdadmin.com/dtds/etd.dtd) the element DISS_access_option
doesn't specify the options (unlike embargo code, e.g. embargo_code (0 | 1 | 2 | 3 | 4) "0"
):
<!--
This element contains the text of the selected access option.
For example "Open access", "Campus use only", etc.
-->
<!ELEMENT DISS_access_option (#PCDATA)>
Is it possible that there is only a limited number of actual values used for DISS_access_option
that we could use to set visibility? If so, a mapping would help, i.e.
Open Access -> open
Campus use only -> authenticated
??? -> restricted
Hey @chrisdaaz We have code written for the change for new dissertations. We're trying to figure out how many of these it may have effected.
We don't have the source files They lifecycle out. Can you download of them? We think we may be able to do some fancy grepping to figure out what we're dealing with.
@bmquinn @davidschober
I ran a report and found 771 "Campus use only" dissertations that are likely in Arch. Here's the report.
When you look at the report, the first column ID
refers to a value we put into "Alternative Identifier". For example, a dissertation with an ID of 15484
would map to http://dissertations.umi.com/northwestern:15594
in the Arch record.
From the report, I can tell that there are two options available for DISS_access_option
:
Open Access -> open
Campus use only -> authenticated
There are also blanks which would mean not applicable -- do nothing.
Another thing: all dissertations added before the batch ingest feature was available will not have that "Alternative Identifier", so the ID field in the report won't help us. Can we match by Title?
Ok @chrisdaaz I wrote a script to generate a new CSV to determine if we can use titles to find all the dissertations. Here's the script I ran (for future reference if needed):
s3 = Aws::S3::Client.new
resp = s3.get_object(bucket: "stack-p-arch-dropbox", key: "titles_names.csv")
csv = CSV.parse(resp.body.string, headers: true, header_converters: :symbol, liberal_parsing: true)
csv_string = CSV.generate do |new_csv|
csv.each.with_index(1) do |row, index|
gw = GenericWork.where(title: Array(row[:title]))&.first
match = gw&.creator&.present? ? gw.creator.any? { |c| c.include?(row[:student_last_name])} : false
new_csv << [index, gw&.id, row[:title], row[:student_last_name], match]
end;nil
end; s3.put_object({acl: "authenticated-read", body: csv_string, bucket: "stack-p-arch-dropbox", key: "title_matches.csv"})
The output csv is at s3://stack-p-arch-dropbox/title_matches.csv
if you want to download it and take a look. If there is a title match the second column should contain the Arch ID for the dissertation (blank means no match, but it could be for a number of reasons including funky character encodings. There are 45 total that didn't match the title query). The last boolean column is a check to see whether the last name in the Proquest spreadsheet is part of any of the creators' names in the record found by title (I hope that sentence is understandable).
@bmquinn moving into in progress. Toss points on it at some point.
@bmquinn wondering what your thoughts are about this idea: what if we applied authenticated access restrictions on filesets while keeping the works public?
authenticated
works in Arch are not discoverable from Google or NUsearch or Arch's browse/search features. They require the user to login before they can find and access a work and its files. Users must somehow how know a work exists in Arch before they can access it.
Users who can access works via NetID authentication currently have no way of discovering dissertations via Google or NUsearch. I wonder if the following scenario could be done programmatically:
This would signal to campus (via Google/ NUsearch indexing) that Arch has dissertations that may be relevant to their research. When they visit the public Work record in Arch and attempt to download the dissertation PDF, they will be prompted to Login with NetID. Does this make sense?
As we discussed, you might not be able to find every dissertation in Arch via the script, so I can check on those remaining dissertations manually.
Please add your planning poker estimate with ZenHub @bmquinn
Hi @chrisdaaz I've been doing some dry-run testing of the script I've written to fix these, but I have a quick question before I hit "go". There are 17 works in the batch of 770 that have FileSets in addition to the PDF with the ProQuest id e.g. XXXX_1234.pdf
(a range of types including video, documents, images, etc.). Should I set the visibility on those the same as the "main" one or leave their visibility as-is? Thanks!
Severity
Is the production site running?
Are staff blocked from performing their work?
Descriptive summary
This happened recently, and may have happened before, so I'm wondering how we might investigate it in case it's an issue. A graduate of Northwestern notified the library that their dissertation was publicly available in Arch. We got their dissertation from ProQuest, who sends us a .zip containing the PDF and an xml record for the dissertation. The batch ingest process should be able to read all of the metadata from the xml record and make appropriate adjustments to the Work record in Arch. However, in this case (and possibly others), it seems that the
<DISS_access_option>Campus use only</DISS_access_option>
results in the Work becoming public.Expected behavior
<DISS_access_option>Campus use only</DISS_access_option>
in Dissertation XML = Visibility: Northwestern in Arch Work recordActual behavior
<DISS_access_option>Campus use only</DISS_access_option>
in Dissertation XML = Visibility: Public in Arch Work record@bmquinn what do you think we should do?
Here's the full XML record for the recent incident: