"Campus use only" releases dissertation to the public

chrisdaaz commented 2 years ago

Severity

Is the production site running?

[x] yes
[ ] no

Are staff blocked from performing their work?

[ ] yes
[x] no

Descriptive summary

This happened recently, and may have happened before, so I'm wondering how we might investigate it in case it's an issue. A graduate of Northwestern notified the library that their dissertation was publicly available in Arch. We got their dissertation from ProQuest, who sends us a .zip containing the PDF and an xml record for the dissertation. The batch ingest process should be able to read all of the metadata from the xml record and make appropriate adjustments to the Work record in Arch. However, in this case (and possibly others), it seems that the <DISS_access_option>Campus use only</DISS_access_option> results in the Work becoming public.

Expected behavior

<DISS_access_option>Campus use only</DISS_access_option> in Dissertation XML = Visibility: Northwestern in Arch Work record

Actual behavior

<DISS_access_option>Campus use only</DISS_access_option> in Dissertation XML = Visibility: Public in Arch Work record

@bmquinn what do you think we should do?

Here's the full XML record for the recent incident:

<?xml version="1.0" encoding="UTF-8"?>
<DISS_submission embargo_code="0" publishing_option="0" third_party_search="N">
    <DISS_authorship>
        <DISS_author type="primary">
            <DISS_name>
                <DISS_surname>Warfel</DISS_surname>
                <DISS_fname>Joseph</DISS_fname>
                <DISS_middle/>
                <DISS_suffix/>
            </DISS_name>
            <DISS_contact type="current">
                <DISS_contact_effdt>12/01/2017</DISS_contact_effdt>
                <DISS_address>
                    <DISS_addrline>PO Box 4802</DISS_addrline>
                    <DISS_city>Skokie</DISS_city>
                    <DISS_st>IL</DISS_st>
                    <DISS_pcode>60076</DISS_pcode>
                    <DISS_country>US</DISS_country>
                </DISS_address>
                <DISS_email>joseph.warfel@u.northwestern.edu</DISS_email>
                <DISS_school_email>joseph.warfel@u.northwestern.edu</DISS_school_email>
            </DISS_contact>
            <DISS_contact type="future">
                <DISS_contact_effdt>12/01/2017</DISS_contact_effdt>
                <DISS_address>
                    <DISS_addrline>201 Oberreich St</DISS_addrline>
                    <DISS_city>LaPorte</DISS_city>
                    <DISS_st>IN</DISS_st>
                    <DISS_pcode>46350</DISS_pcode>
                    <DISS_country>US</DISS_country>
                </DISS_address>
                <DISS_email>joseph.warfel@u.northwestern.edu</DISS_email>
                <DISS_school_email>joseph.warfel@u.northwestern.edu</DISS_school_email>
            </DISS_contact>
            <DISS_citizenship/>
            <DISS_orcid/>
        </DISS_author>
    </DISS_authorship>
    <DISS_description apply_for_copyright="no" external_id="http://dissertations.umi.com/northwestern:14002" page_count="" type="doctoral">
        <DISS_title>Operations Management of Food Recovery Programs</DISS_title>
        <DISS_dates>
            <DISS_comp_date>2017</DISS_comp_date>
            <DISS_accept_date>01/01/2017</DISS_accept_date>
        </DISS_dates>
        <DISS_degree>Ph.D.</DISS_degree>
        <DISS_institution>
            <DISS_inst_code>0163</DISS_inst_code>
            <DISS_inst_name>Northwestern University</DISS_inst_name>
            <DISS_inst_contact>Industrial Engineering and Management Sciences</DISS_inst_contact>
            <DISS_processing_code>D</DISS_processing_code>
        </DISS_institution>
        <DISS_advisor>
            <DISS_name>
                <DISS_surname>Smilowitz</DISS_surname>
                <DISS_fname>Karen</DISS_fname>
                <DISS_middle>R.</DISS_middle>
            </DISS_name>
        </DISS_advisor>
        <DISS_advisor>
            <DISS_name>
                <DISS_surname>Iravani</DISS_surname>
                <DISS_fname>Seyed</DISS_fname>
                <DISS_middle>M. R.</DISS_middle>
            </DISS_name>
        </DISS_advisor>
        <DISS_cmte_member>
            <DISS_name>
                <DISS_surname>Balçık</DISS_surname>
                <DISS_fname>Burcu</DISS_fname>
                <DISS_middle/>
                <DISS_suffix/>
            </DISS_name>
        </DISS_cmte_member>
        <DISS_categorization>
            <DISS_category>
                <DISS_cat_code>0796</DISS_cat_code>
                <DISS_cat_desc>Operations research</DISS_cat_desc>
            </DISS_category>
            <DISS_keyword/>
            <DISS_language>en</DISS_language>
        </DISS_categorization>
    </DISS_description>
    <DISS_content>
        <DISS_abstract>
            <DISS_para>Food recovery programs (FRPs) divert potential waste at grocery stores so that it can be distributed to people who do not have enough food. FRPs are administered by food banks, nonprofit organizations dedicated to the alleviation of hunger. The primary purpose of FRP is to collect donations. Eventually, the food is distributed to other nonprofit organizations (referred to as “agencies”) which in turn provide it to families and individuals. A few food banks include agencies on FRP routes, a practice that is becoming more common. This innovation presents opportunities and challenges: the presence of agencies allows the food bank to reduce the required vehicle capacity and more quickly distribute perishable food, but donations are random, so it is difficult to provide consistent service to the agencies.</DISS_para>
            <DISS_para>In this dissertation, we study three closely related models of FRP operations.</DISS_para>
            <DISS_para>The one-commodity pickup and delivery allocation problem (1-PDA) models allocation decisions for a given FRP route. The objective of the 1-PDA is to minimize the required vehicle capacity. We develop a simple three-step algorithm, the MILB algorithm, that obtains an optimal solution to the 1-PDA.</DISS_para>
            <DISS_para>We augment the 1-PDA with agency selection and node sequencing decisions to formulate the selective 1-PDTSP with stochastic supply as a mixed-integer linear program (MILP). It is possible to solve the problem with a MILP solver, but the solution time is prohibitive for many realistic instances. Therefore, we propose a heuristic procedure, the capacity reuse insertion heuristic (CRIH), based on inserting agencies into existing FRP routes. In a case study based on data provided by Northern Illinois Food Bank, we obtain insights regarding agency selection and node sequencing for FRP. We also demonstrate that CRIH provides near-optimal solutions.</DISS_para>
            <DISS_para>To model FRP operations at food banks where routing is inflexible and the food obtained from FRP is crucial to agency operations, we generalize the 1-PDA to model the one-commodity pickup and delivery allocation problem for agency-supporting FRP (the 1-PDA-as). The 1-PDA-as differs from the 1-PDA by including parameters that specify additional service requirements at donors and agencies. The objective of the 1-PDA-as is to maximize total donations collected for a given route. By applying several reformulations, we develop an optimal solution procedure for the 1-PDA-as that relies on solving a series of linear programs; however, this solution procedure cannot be applied to many realistic instances due to issues of numerical precision. Therefore, we propose a heuristic solution procedure based on the MILB algorithm. In a case study, we obtain insights about node parameters and node sequencing. We also demonstrate that the</DISS_para>
            <DISS_para>heuristic generates near-optimal solutions.</DISS_para>
        </DISS_abstract>
        <DISS_binary type="PDF">Warfel_northwestern_0163D_14002.pdf</DISS_binary>
    </DISS_content>
    <DISS_restriction/>
    <DISS_repository>
        <DISS_version>2017-01-23 16:32:28</DISS_version>
        <DISS_agreement_decision_date>2017-12-01 04:53:36</DISS_agreement_decision_date>
        <DISS_acceptance>1</DISS_acceptance>
        <DISS_delayed_release/>
        <DISS_access_option>Campus use only</DISS_access_option>
    </DISS_repository>
    <DISS_creative_commons_license>
        <DISS_abbreviation/>
    </DISS_creative_commons_license>
</DISS_submission>

bmquinn commented 2 years ago

@chrisdaaz Looking over the XML DTD (can be accessed at https://secure.etdadmin.com/dtds/etd.dtd) the element DISS_access_option doesn't specify the options (unlike embargo code, e.g. embargo_code (0 | 1 | 2 | 3 | 4) "0"):

<!--
  This element contains the text of the selected access option.
  For example "Open access", "Campus use only", etc.
-->
<!ELEMENT DISS_access_option (#PCDATA)>

Is it possible that there is only a limited number of actual values used for DISS_access_option that we could use to set visibility? If so, a mapping would help, i.e.

Open Access -> open
Campus use only -> authenticated
??? -> restricted

davidschober commented 2 years ago

Hey @chrisdaaz We have code written for the change for new dissertations. We're trying to figure out how many of these it may have effected.

We don't have the source files They lifecycle out. Can you download of them? We think we may be able to do some fancy grepping to figure out what we're dealing with.

chrisdaaz commented 2 years ago

@bmquinn @davidschober

I ran a report and found 771 "Campus use only" dissertations that are likely in Arch. Here's the report.

When you look at the report, the first column ID refers to a value we put into "Alternative Identifier". For example, a dissertation with an ID of 15484 would map to http://dissertations.umi.com/northwestern:15594 in the Arch record.

From the report, I can tell that there are two options available for DISS_access_option:

Open Access -> open
Campus use only -> authenticated

There are also blanks which would mean not applicable -- do nothing.

Another thing: all dissertations added before the batch ingest feature was available will not have that "Alternative Identifier", so the ID field in the report won't help us. Can we match by Title?

bmquinn commented 2 years ago

Ok @chrisdaaz I wrote a script to generate a new CSV to determine if we can use titles to find all the dissertations. Here's the script I ran (for future reference if needed):

s3 = Aws::S3::Client.new
resp = s3.get_object(bucket: "stack-p-arch-dropbox", key: "titles_names.csv")
csv = CSV.parse(resp.body.string, headers: true, header_converters: :symbol, liberal_parsing: true)

csv_string = CSV.generate do |new_csv|
  csv.each.with_index(1) do |row, index|
    gw = GenericWork.where(title: Array(row[:title]))&.first
    match = gw&.creator&.present? ? gw.creator.any? { |c| c.include?(row[:student_last_name])} : false
    new_csv << [index, gw&.id, row[:title], row[:student_last_name], match]
  end;nil
end; s3.put_object({acl: "authenticated-read", body: csv_string, bucket: "stack-p-arch-dropbox", key: "title_matches.csv"})

The output csv is at s3://stack-p-arch-dropbox/title_matches.csv if you want to download it and take a look. If there is a title match the second column should contain the Arch ID for the dissertation (blank means no match, but it could be for a number of reasons including funky character encodings. There are 45 total that didn't match the title query). The last boolean column is a check to see whether the last name in the Proquest spreadsheet is part of any of the creators' names in the record found by title (I hope that sentence is understandable).

davidschober commented 2 years ago

@bmquinn moving into in progress. Toss points on it at some point.

chrisdaaz commented 2 years ago

@bmquinn wondering what your thoughts are about this idea: what if we applied authenticated access restrictions on filesets while keeping the works public?

authenticated works in Arch are not discoverable from Google or NUsearch or Arch's browse/search features. They require the user to login before they can find and access a work and its files. Users must somehow how know a work exists in Arch before they can access it.

Users who can access works via NetID authentication currently have no way of discovering dissertations via Google or NUsearch. I wonder if the following scenario could be done programmatically:

Find dissertations that have "Campus use only" values in their ProQuest XML metadata
Change the visibility of those Works to Public
Change the visibility of those Works's Filesets to Northwestern

This would signal to campus (via Google/ NUsearch indexing) that Arch has dissertations that may be relevant to their research. When they visit the public Work record in Arch and attempt to download the dissertation PDF, they will be prompted to Login with NetID. Does this make sense?

As we discussed, you might not be able to find every dissertation in Arch via the script, so I can check on those remaining dissertations manually.

kdid commented 2 years ago

Please add your planning poker estimate with ZenHub @bmquinn

bmquinn commented 2 years ago

Hi @chrisdaaz I've been doing some dry-run testing of the script I've written to fix these, but I have a quick question before I hit "go". There are 17 works in the batch of 770 that have FileSets in addition to the PDF with the ProQuest id e.g. XXXX_1234.pdf (a range of types including video, documents, images, etc.). Should I set the visibility on those the same as the "main" one or leave their visibility as-is? Thanks!

nulib / arch