sul-dlss / FOLIO-Project-Stanford

Task management for Stanford’s analysis of FOLIO.
2 stars 0 forks source link

Data cleanup - remove ezproxy prefixes #405

Closed ahafele closed 1 year ago

ahafele commented 1 year ago

In the last SW work cycle we changed the way ezproxy prefixes are generated and it now happens dynamically in SearchWorks. Acq would like to have the ezproxy prefix removed from all existing records. Needs coordination with Jeanette on preprocessing script updates.

dlrueda commented 1 year ago

Is it possible that we only have ~500 records to fix for this? I did a search in all marc fields for “ezproxy” and only got 483 keys (all but 1 or 2 were in 856).

If that is correct we probably should just have Datacontrol fix these with SDC? And fix the preprocessing scripts at the same time/beforehand of course.

ahafele commented 1 year ago

Hmm, and these 483 look like law records. Maybe Jeanette has already done this work with SDC? @jlkalchik ?

trapido commented 1 year ago

@dlrueda The proxy prefix that we would like to remove is “ https://stanford.idm.oclc.org/login?url=“. Based on a quick search for “stanford.idm.oclc.org” in 856, there are about 544K records. Thanks so much for your help with this!

dlrueda commented 1 year ago

I find the same amount of records searching for stanford.idm.oclc.org, 544,031 records.

@jlkalchik and @trapido Please confirm that I don’t need to evaluate the data beyond doing a search for “stanford.idm.oclc.org” and then substituting “” for the entire string “https://stanford.idm.oclc.org/login?url=" in only 856 or 956 tags. (I can do the search for that whole string also to be more precise.)

jlkalchik commented 1 year ago

stanford.idm.oclc.org.xlsx

I believe it will need to be more precise than just stanford.idm.oclc.org. A search of BCA for that and not https://stanford.idm.oclc.org/login?url= retrieves 5130 records, most of which seem to be records from the DRAM package which requires a special proxy prefix.

trapido commented 1 year ago

@dlrueda To exclude records that have the proxy stem in the middle of the url, it is best to select based on the entire proxy prefix, https://stanford.idm.oclc.org/login?url=. Yes, that’s is correct, we want to replace https://stanford.idm.oclc.org/login?url= with “”. Also, I’ve checked with Law and they are asking us to postpone this work until next week – would it be okay? Thanks so much!

dlrueda commented 1 year ago

ok, so to do a final confirmation @trapido @jlkalchik, I should look for https://stanford.idm.oclc.org/login?url= in ANY part of the 856 or 956 and replace it with “”

Fine to wait until next week, it will take me a bit to get this ready to go anyway. I’ll proceed to write it up and do some tests on Morison in the meantime.

ahafele commented 1 year ago

@jlkalchik can you confirm that all the preprocessing scripts have been updated to stop adding this? We should wait on that if not. Do we need to let Law know we are doing this?

jlkalchik commented 1 year ago

@ahafele Not all have been updated yet so I will try to get to them by the end of next week which will fit well with Law since they wanted to wait until next week.

trapido commented 1 year ago

@dlrueda Yes, that is correct: we need to look for https://stanford.idm.oclc.org/login&url= anywhere in 856 or 956 and replace it with “”

jlkalchik commented 1 year ago

I know there are some records where the proxy prefix is duplicated: https://searchworks.stanford.edu/?search_field=search&q=%22https%3A%2F%2Fstanford.idm.oclc.org%2Flogin%26url%3Dhttps%3A%2F%2Fstanford.idm.oclc.org%2Flogin%26url%3D%22

Will these be updated or should they be updated after project using SDC?

jlkalchik commented 1 year ago

One other place that seems to get the EZproxy prefix automatically is ProQuest Dissertations & Theses URLs.

dlrueda commented 1 year ago

One other place that seems to get the EZproxy prefix automatically is ProQuest Dissertations & Theses URLs.

Thanks! I’ve updated that script to no longer add the ezproxy prefix to the 856$u

dlrueda commented 1 year ago

Script created /s/SUL/OneTime/Ezproxy-prefix-remove/find_remove_ezproxy_prefix.ksh

2 hour run time on Morison for 465,047 keys.

Bodoni prepped with 520,359 non-LAW ckeys. Will run Saturday 5/20 after adutext finished processing for the day.

Before running, add “-s” flag to hourly SearchWorks export in cron and leave on for 5/20. On 5/21, remove “-s” flag. Leave on for nightly incremental export (that runs on 5/21)

dlrueda commented 1 year ago

Work complete, records still being metered through adutext but work is done.