CIL metadata transformation QA

arwenhutt commented 5 years ago

Descriptive summary

Review transformation output from CIL harvest process and update mapping/script.

part of #316

lsitu commented 4 years ago

@arwenhutt / @abbypenn93 I think the new version of CIL metadata transform is ready for review now, which is located at the same location in rdcp-staging: /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_object_input.csv

/rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_subject_headings.csv

Thanks.

abbypenn93 commented 4 years ago

I should be able to check it out tomorrow.

lsitu commented 4 years ago

@abbypenn93 I've run the CIL process on staging to update the harvest for August 1 to fix the missing contents files issue as you mentioned in https://github.com/ucsdlib/damsmanager/issues/317#issuecomment-523546257 above. Could you review it again? Thanks.

abbypenn93 commented 4 years ago

@arwenhutt @lsitu @hjsyoo Results from a check for missing files: 1) The objects 50582 and 50583 are missing their .jpg file (in both cases the other files referenced in the json are 3view-stack-final-bin1.mrc and 3view-stack-final-bin10.mrc). The spreadsheet correctly lists related image files. 2) No related image files: json metadata for 50601, 50602, and 50603 points to 50600.zip and 50600.jpg. This is weird because when performing an image data download for these objects on the CIL data website, the correct file name is listed in the download window (e.g. http://www.cellimagelibrary.org/images/50601).

abbypenn93 commented 4 years ago

@lsitu When you feel like all the issues have been resolved, we'll need a full dump of non-copyrighted records. This will enable us to check that issues related to specific objects have been addressed and that all CIL metadata fields are being transformed properly. Objects in the latest harvests only utilize a subset of the full CIL metadata schema and so cannot be used to verify software fixes.

This request is also listed in Ho Jung's email from today (2019/09/03)

lsitu commented 4 years ago

@abbypenn93 I think the output file /rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_object_input.csv (see https://github.com/ucsdlib/damsmanager/issues/317#issuecomment-526228753) includes all records, and we will know whether there are any other issues exists once you review it. But if you would like to do a full dump of all records now, please delete those folders generated from CIL harvest under /rdcp-0126-cil/ and I can run the CIL process for a new round of clean dump. Thanks.

abbypenn93 commented 4 years ago

@lsitu Why do you want to delete the old harvest files?

jessicahilt commented 4 years ago

@abbypenn93 Just an FYI that @lsitu is on vacation today.

hjsyoo commented 4 years ago

Hi all, I'd like to describe here some terminology and versioning challenges that seem to be causing some confusion. First, let's standardize the terms we use for the CIL workflow (feel free to suggest better ones or add something I've missed):

source metadata harvest = the JSON files at https://cilia.crbs.ucds.edu/rest/public_ids?*" are copied into /rdcp-0126-cil/cil_harvest_YYYY-MM-DD/metadata_source.
data harvest = the data (image and video) files at https://cildata.crbs.ucsd.edu/media are copied into \rdcp-0126-cil\cil_harvest_YYYY-MM-DD\content_files.
metadata transformation = the metadata in the harvested JSON files is transformed into our OLR metadata template according to CIL Processing and Mapping Instructions, and saved as \rdcp-0126-cil\cil_harvest_YYYY-MM-DD\metadata_processed\cil_excel_object_input.csv.

I'm pretty certain step 1 has to happen first, but not sure about steps 2 and 3. Is there a fixed sequence, or are these steps performed independently of one another?

@arwenhutt @lsitu @abbypenn93 Correct me if I'm wrong, but the date in the \cil_harvest_YYYY-MM-DD folder name is meant to reflect the date of the source metadata harvest? And even if data harvest or metadata transformation occurs after this date, the results of running the script on source metadata harvested on this date should go into this same folder? This further means that we could run the metadata transformation at any future date, on source metadata that we harvest today, and the only way to know which version of the script was run is to look at Date modified in the properties of \metadata_processed\cil_excel_object_input.csv file. If all of this is true, this means that currently, \rdcp-0126-cil\cil_harvest_2019-03-07\metadata_processed\cil_excel_object_input.csv is the result of running the transformation script that was current on 2019-08-21, on the JSON files that were harvested on 2019-03-07. Unfortunately, this last bit doesn't seem to be the case - the JSON files are dated 2019-08-20.

As the next step in QA, I agree we should do another complete metadata & data harvest followed by metadata transformation, so Abby is certain to be QAing the latest transformation (OLR) on the full set of metadata from CIL? Then, going forward, can we standardize the way we generate harvest folders in rdcp-staging? If anyone has a suggestion for revising the suggested workflow above, please comment. @abbypenn93 has a suggestion for renaming the existing folder rdcp-0126-cil and creating a new, empty rdcp-0126-cil so @lsitu can start the next harvest with a fresh folder. Does this sound agreeable to all?

lsitu commented 4 years ago

@abbypenn93 For those old harvest files, do you have anything want to keep? I think you can rename it to something with different naming patterns if you need it for future reference. We can start a new round of a full dump to initiate a clean harvest process as initiation once you give it a pass. Thanks.

@hjsyoo For your question regarding the CIL workflow sequence, step#1 and step#2 are happening at the same time. That is, the content files of a object will be downloaded once a JSON source file is retrieved. So the timestamp for the JSON could have different dates on it since it may take several days to download all the CIL objects. After all JSON's and content files are downloaded, step#3 for metadata transformation will be started, which will take several days for the initiation step as well since we need to run SPARQL to check for existing object basing on the samplenumber identifier.

Since we have no control over the CIL public API to download JSON's within a specific time range, so if for any reason we need to redo a harvest, it will include all objects that were modified since that specific lastModified date.

While moving forward to prod, we may need to either disable the automatic monthly harvest process on QA and Staging, or configuration a different staging area/folder for QA and Staging. What do you want to do?

hjsyoo commented 4 years ago

@lsitu Thanks for clarifying the workflow sequence. Does the SPARQL script verify that all the data files listed in the JSON records exist at cildata.crbs.ucsd.edu, or does it check that they were successfully downloaded to \content_files?

It makes sense that in the ongoing harvests, we will only grab objects that were modified since the lastModified date. I'll convey that workflow to Willy so he knows that modified objects will get duplicated in the DAMS. From what he's told us previously, this should be fine since they don't normally modify their objects. But if he suddenly does a batch edit on all the JSON files, I'm assuming we'll end up harvesting everything again?

Having a separate top level folder in rdcp-staging for the development harvests makes sense. We should also disable automatic harvesting on QA and Staging once we move to prod. I think @abbypenn93 already has a plan for folder organization, and will let us know when it's ready!

abbypenn93 commented 4 years ago

There now exists two CIL folders on staging: rdcp-0126-cil (folder for new harvests) rdcp-0126-cil-development (previous harvests used during early stages of QA)

abbypenn93 commented 4 years ago

Please note that I'll be attending the RDC retreat today and tomorrow and as a result will be slow to respond to messages. Also, I'm currently revising the "CIL Processing and Mapping" document so it will be a bit messy for a few days.

lsitu commented 4 years ago

@hjsyoo The SPARQL will check for CIL objects already ingested in DAMS basing on the samplenumber identifier so that no duplicate objects will be ingested. So all previous ingested objects that were modify will be ignored with the current process.

lsitu commented 4 years ago

@abbypenn93 I've initiated a full CIL harvest dump to the empty directory rdcp-0126-cil, which will take several days to be finished.

While moving forward with CIL harvest on prod, let's use dcp-0126-cil-development (previous harvests used during early stages of QA) for QA and staging environment. And the current location rdcp-0126-cil (folder for new harvests) will be cleanup for prod once QA is done. Does it sounds good?

@mcritchlow I've created PR for the change: https://github.com/ucsdlib/private_config/pull/20. It's ready for review now. Thanks.

abbypenn93 commented 4 years ago

@lsitu In the process of updating the CIL documentation, I realized that there's a problem with the CIL_CCDB.CIL.CORE.ATTRIBUTION.PUBMED field. Entries in the spreadsheet for PubMed related publications appear like this one: PubMed ID: https://www.ncbi.nlm.nih.gov/pubmed/?term=PMID:15036382

However, this link doesn't fully resolve. Instead, it should read as follows (without the "PMID:" string): PubMed ID: https://www.ncbi.nlm.nih.gov/pubmed/?term=15036382

Pre-processed string: "PUBMED":["PMID:15036382"]

lsitu commented 4 years ago

@abbypenn93 I think I can make all change you need together. So just update the instructions in the documentation and let me know what need to change. It'll be nice if you can add examples for the mapping. Thank you.

hjsyoo commented 4 years ago

@abbypenn93 I've made a minor revision to the processing instructions in https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337 for the following mappings. Please check that they look consistent with your instructions.

CIL_CCDB.Citation.DOI --> Related resource:related
CIL_CCDB.CIL.Citation.Title --> Note:preferred citation

abbypenn93 commented 4 years ago

Great, thanks.

lsitu commented 4 years ago

@abbypenn93 Can you list the details of all mapping changes that we need so that I can update damsmanager for it? One issue I see is that the JSON source path CIL_CCDB.CIL.Citation.Title in the spreadsheet has the extra key .CIL, which should be CIL_CCDB.Citation.Title. Right? Thanks.

abbypenn93 commented 4 years ago

I think the documentation revision for the CIL_CCDB.Citation.Title field was made a while back--anyway, it looks okay now.

I'm updating the documentation to reflect changes based on our discussions in this ticket and the resulting code changes. I'll be done later this morning and will let you know when the document is ready for review.

The report from Friday, related to CIL_CCDB.CIL.CORE.ATTRIBUTION.PUBMED, is not a change to the transformation script functionality (though I did make a slight change to the way the processing is described for clarification).

Please let me know if you need anything else.

lsitu commented 4 years ago

@abbypenn93 From coding and damsmanager's side, we need all paths in the documentation (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337) to be correct, and any changes that are made since the first round of QA need to be listed so that I can update the codes for it. Otherwise, the required changes at your end won't be corrected during CIL transformation at all. Thank you.

abbypenn93 commented 4 years ago

Please review the revised mapping document "CIL Ongoing Harvesting Process" (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337).

Revisions include: a. Items discussed in this ticket (especially those items from Aug. 1 (and discussed in question and answer fashion in the Aug. 2 comment) b. The addition of raw (unprocessed) metadata examples. Most cells in the document are affected. c. In some cases, more extensive processing instructions.

As far as I am aware, items a-c match the functionality of the current metadata transformation script.

There are three new issues that have not yet been addressed in the metadata transformation script:

d. From Ho Jung's comment a few days ago (https://github.com/ucsdlib/damsmanager/issues/317#issuecomment-529037386)

CIL_CCDB.Citation.DOI --> Related resource:related -- The Related Resource should not include "doi:" in the URL.
CIL_CCDB.CIL.Citation.Title --> Note:preferred citation -- The Preferred Citation doesn't need "DOI placeholder"

e. From my comment from Friday (https://github.com/ucsdlib/damsmanager/issues/317#issuecomment-529029547) regarding PubMed link formatting. There is no change to the specification just a clarification.

I'm using letters to itemize here since we use numbers for each CIL issue reported.

Longshou, there is a question for you in cell D4 of the mapping spreadsheet.

Please let me know if I missed anything! Thanks, Abby

lsitu commented 4 years ago

@abbypenn93 Thanks. It is really helpful to track down for updating the codes in damsmanager. For you question in cell D4, the Object Unique ID is just the json source filename in metadata_source folder, which is coming from the public CIL API https://cilia.crbs.ucds.edu/rest/public_ids?* while looking up the object IDs and named it at the time the source JSON is downloaded.

lsitu commented 4 years ago

@abbypenn93 Could you review cell D84 for the path CIL_CCDB.CIL.Citation.Title again ? I believe it should be CIL_CCDB.Citation.Title (without the CIL key). Thanks.

abbypenn93 commented 4 years ago

Yes! I think you've pointed this out before but it didn't quite sink in, thanks! Set to CIL_CCDB.Citation.Title now.

lsitu commented 4 years ago

@abbypenn93 I see the CIL harvest process is broken with the latest version of the CIL Processing and Mapping Instructions:

The first row is Terminology now while it's the column headers in the old version.
Values in cell D5 is changed to expressions while it was in multi-value format data|still image.

Just let you know that DAMS Manager will parse and apply the original CIL Processing and Mapping Instructions for the mapping dynamically. So any changes in columns Source / CCDB field, Ingest File Header, and Processing and Fixed Values Instructions may break the CIL harvest process if we fail to adjust the codes respectively. Are there any other changes the columns above that I should pay attention to? Thanks.

lsitu commented 4 years ago

For the change CIL_CCDB.CIL.Citation.Title --> Note:preferred citation -- The Preferred Citation doesn't need "DOI placeholder" in https://github.com/ucsdlib/damsmanager/issues/317#issuecomment-529667020, do you have an example for The Preferred Citation doesn't need "DOI placeholder"?

hjsyoo commented 4 years ago

I was just thinking that the text, "DOI placeholder", in the citation is unnecessary, since the next step will be to ask you to do a batch DOI minting for all the records? Is this correct? I've requested batch minting after ingest for other collections, so I assumed we would do the same for this collection. That is, minting will be done after the records are ingested into prod. So, the processed citation we would need for the OLR is, for example: Richard Wheeler (2019). CIL:10603, Leishmania mexicana, parasite. In Cell Image Library. UC San Diego Library Digital Collections. Dataset. If this is incorrect, then that particular processing workflow should be revised. Would it be easier/better to mint DOIs and insert them during the processing step? I wouldn't think so, since we won't have a target location until ingest to prod is done.

lsitu commented 4 years ago

@hjsyoo Got it. Thanks. I'll simply trim the text "DOI_placeholder" from the end of the Note:preferred citation then.

arwenhutt commented 4 years ago

@lsitu I have a question about this:

Just let you know that DAMS Manager will parse and apply the original CIL Processing and Mapping Instructions for the mapping dynamically. So any changes in columns Source / CCDB field, Ingest File Header, and Processing and Fixed Values Instructions may break the CIL harvest process if we fail to adjust the codes respectively. Are there any other changes the columns above that I should pay attention to? Thanks.

It sounds like you are saying that the mapping document we created (CIL Processing and Mapping Instructions) is actually driving the code. So if we modify anything in that spreadsheet - it impacts the transformation process itself.

Is that correct?

lsitu commented 4 years ago

@arwenhutt Yes. The spreadsheet is downloaded and applied for CIL transformation process in damsmanager. The columns that may impact the transformation process are columns Source / CCDB field, Ingest File Header, and Processing and Fixed Values Instructions. Any value and format changes in the columns above may impact the CIL transformation process. So far I detected the following changes had break the CIL transformation process:

The new row Terminology added that is preceding the column headers in the old version.
Values in cell D5 changed to expressions instead of the multi-value format data|still image in the old version.
The format change in Row#58 - Row#75 for note technical details.

I think it's good to highlighted all the changes so that we can pay attention to it. Thanks.

lsitu commented 4 years ago

@mcritchlow I've added PR https://github.com/ucsdlib/damsmanager/pull/365 to apply the new mapping instructions template and fix the mapping issues that were raise above. It's ready for review now. Thanks.

arwenhutt commented 4 years ago

@lsitu Oh, wow, we had no idea...we wrote and have been working with the document as a blueprint or set of instructions rather than as a piece of the code itself. Is this a new approach to metadata transformation? or one that has been used for other collections?

lsitu commented 4 years ago

@arwenhutt I think the CIL collection is the first collection that we are pursuing the dynamic transformation in damsmanager. The ideas here is just trying to follow the syntax and format in the spreadsheet, which could give it some flexibility over the mappings. A similar approach is applied for the Excel Input Stream Template for header and values validation. But this approach can be applied to other collections with JSON source if the mapping is following the same pattern though.

lsitu commented 4 years ago

@hjsyoo @arwenhutt @abbypenn93 With the updates in the new version of the mapping document (CIL Processing and Mapping Instructions), should we stop the current full CIL harvest dump on staging and start a new process with the new mapping document instead? I just checked the full CIL Harvest dump that are running on staging and I see it's still downloading JSON with content files so it will take days to finish up.

abbypenn93 commented 4 years ago

@lsitu @arwenhutt I'm not so much concerned about the harvest itself (though I'm hoping to do QA on all available objects), as the metadata transformation process. If it is all done in one step then, yes please stop the harvest and restart it using the latest script based on the new mapping document.

lsitu commented 4 years ago

@arwenhutt / @abbypenn93 Have you created the folder rdcp-0126-cil-staging-qa for staging and QA? I think we can deploy damsmanager and private-config with the new mapping instructions to staging, which will stop the current CIL harvest process. And I can start a new round of full CIL dump in the folder rdcp-0126-cil-staging-qa on staging.

arwenhutt commented 4 years ago

@lsitu yep rdcp-staging\rdcp-0126-cil-staging-qa

arwenhutt commented 4 years ago

@lsitu it looks like file download to rdcp-0126-cil-staging-qa stopped yesterday morning ~7:30am, but the processed metadata files haven't shown up on rdcp-staging yet. Can you tell where it's at/what's happening?

@jessicahilt mentioned that you'll be working on Chronopolis for the next few weeks, which would be a good time for @abbypenn93 to be reviewing the updated metadata output.

lsitu commented 4 years ago

@arwenhutt It looks like that the lib-hydratail-staging server is rebooted yesterday. Would you like to delete the source json and contents that were downloaded for a fresh restart, or just override the json source dowloaded wihile keeping those content files?

arwenhutt commented 4 years ago

@lsitu we aren't sure if all the metadata files downloaded or if it was interrupted. Can you re-harvest the json source metadata files and start the metadata transformation process?

lsitu commented 4 years ago

@arwenhutt I think it was interrupted by server reboot and I'd restarted the CIL harvest process last night. But I don't know how long it'll take since it's still downloading the JSON source files at this time.

abbypenn93 commented 4 years ago

I'll start QAing the new harvest and metadata transformation this week.

lsitu commented 4 years ago

Thanks @abbypenn93. It's ready for you finally. Just to note that we started the CIL harvest initiation process on Friday, Sep. 13 and it was interrupted a couple of times on staging. It looks it will take around two weeks to have the CIL harvesting initiation step done.

abbypenn93 commented 4 years ago

@lsitu Hopefully future harvests won't take as long since there will be fewer objects.

lsitu commented 4 years ago

@lsitu Agree. Only the initiation step that takes so long. And it looks like it may touch a stalled connection issue from the other side while download content files on staging.

abbypenn93 commented 4 years ago

QA of cil_harvest_2019-09-13 metadata transformation

This metadata transformation is really looking good (especially Note:technical details)!

Remaining issues:

Please note: the item numbers below do not refer to items reported in previous QA reports.

Item 1: Ingest File Header: person:researcher

a. Names are present but are listed in separate columns (should be located in one column and separated by a " | "). There are over 35 columns due to objects like 45701 which have a long list of researchers.

b. All the researcher names should appear in the heading ingest file (cil_excel_subject_headings.csv):

subject type subject term person:researcher CIL_CCDB.CIL.CORE.ATTRIBUTION.Contributors

Item 2: Ingest File Header: subject:topic

Topics are present but are listed in separate columns (should be in one column, separated by a " | ").

Item 3: Ingest File Header: subject:anatomy

Topics in this section are present but are listed in separate columns (should be in one column, separated by a " | ").

Item 4: Ingest File Header: Note:description

Encoding (from source json file) handling issue affecting 1136 objects. This content is not needed. Please delete the tags and values within the tags.

Examples start at object 32167.

Item 5: Ingest File Header: Related resource:related

Encoding error (96 instances). Registered trademark symbol (®) appears at the end of the string: “2004 Olympus BioScapes Digital Imaging Competition¬Æ @ …” Symbol is correct in the json file.

Example object: 42516

Did a test ingest using ¬Æ symbol in DAMS staging. DAMS does not convert this into a registered trademark symbol: http://www.olympusbioscapes.com/staticgallery/2004/hm19.html

Item 6: Ingest File Header: copyright status

“copyright status” column not present. Return (non-copyright) value that appears after "TERMSANDCONDITIONS": {"free_text":

Please note that I'm away at a conference this week so I may be slow to respond to messages. Thanks, Abby

lsitu commented 4 years ago

Thanks @abbypenn93. Following are my comments for the issues you reported:

Remaining issues:

Please note: the item numbers below do not refer to items reported in previous QA reports.

Item 1: Ingest File Header: person:researcher

a. Names are present but are listed in separate columns (should be located in one column and separated by a " | "). There are over 35 columns due to objects like 45701 which have a long list of researchers.

A: As we discussed for Language tag last time, I think we use the same flat spreading columns for all our Excel export/import tools like mods/marc import tool, Batch Export, Batch Import tool etc. So I think we should use the same flat columns pattern over all our Excel format support tools.

b. All the researcher names should appear in the heading ingest file (cil_excel_subject_headings.csv):

subject type subject term person:researcher CIL_CCDB.CIL.CORE.ATTRIBUTION.Contributors

A: Yes, we only include columns starting with subject: in cil_excel_subject_headings.csv. I'll add all columns starting with person: then. But how about columns starting with corporate: if there's any such columns?

Item 2: Ingest File Header: subject:topic

Topics are present but are listed in separate columns (should be in one column, separated by a " | ").

A: The same. Please refer to my comment in Item 1 above.

Item 3: Ingest File Header: subject:anatomy

Topics in this section are present but are listed in separate columns (should be in one column, separated by a " | ").

A: The same. Please refer to my comment in Item 1 above.

Item 4: Ingest File Header: Note:description

Encoding (from source json file) handling issue affecting 1136 objects. This content is not needed. Please delete the tags and values within the tags.

Examples start at object 32167.

A: What content is not needed, Note:description? Could you give me an example regarding what tags and values need to delete inPlease delete the tags and values within the tags.?

Item 5: Ingest File Header: Related resource:related

Encoding error (96 instances). Registered trademark symbol (®) appears at the end of the string: “2004 Olympus BioScapes Digital Imaging Competition¬Æ @ …” Symbol is correct in the json file.

Example object: 42516

Did a test ingest using ¬Æ symbol in DAMS staging. DAMS does not convert this into a registered trademark symbol: http://www.olympusbioscapes.com/staticgallery/2004/hm19.html

A: Sure. I'll see how to fix the corrupted trademark symbol.

Item 6: Ingest File Header: copyright status

“copyright status” column not present. Return (non-copyright) value that appears after "TERMSANDCONDITIONS": {"free_text":

A: Hmm, this was fixed with tests passed for transformation. I'll see why it's missing from the CSV output.

lsitu commented 4 years ago

@abbypenn93 Have you got a chance to take a look at my comments above? I can provide my update on Item 5 and Item 6 above now: Item 5: Ingest File Header: Related resource:related

Encoding error (96 instances). Registered trademark symbol (®) appears at the end of the string: “2004 Olympus BioScapes Digital Imaging Competition¬Æ @ …” Symbol is correct in the json file.

Example object: 42516

Did a test ingest using ¬Æ symbol in DAMS staging. DAMS does not convert this into a registered trademark symbol: http://www.olympusbioscapes.com/staticgallery/2004/hm19.html

A: Sure. I'll see how to fix the corrupted trademark symbol. Update: When opening it with a text editor, I see the symbol for trademark looks correct. So I guess that you just double click to open the CSV by Excel. I would suggest import the CSV from Excel with utf-8 format instead: Mark Ross, 2004 Olympus BioScapes Digital Imaging Competition® (2019) CIL:42516, Rattus. In Cell Image Library.

Item 6: Ingest File Header: copyright status

“copyright status” column not present. Return (non-copyright) value that appears after "TERMSANDCONDITIONS": {"free_text":

A: Hmm, this was fixed with tests passed for transformation. I'll see why it's missing from the CSV output. Update: I think it's expecting column title to be copyrightStatus instead of copyright status as what we do in the Batch Export tool. I would update it to support mapping for header copyright status in the CIL Harvesting if this is just for internal reference. But let me know if you want to make it consistent with the headers use in the Batch Export tool.

ucsdlib / damsmanager

CIL metadata transformation QA #317

Descriptive summary