Closed arwenhutt closed 4 years ago
@lsitu I think we can go ahead with reviewing the metadata transformation before the updated process for harvesting files is figured out. It looks like the json files were downloaded to staging, can we get the metadata ingest files from step 4? Thanks!
@arwenhutt Yes. I think we can start to verify it once Release 2.71 is done.
@lsitu great, thanks!
@rstanonik It looks like damsmanager don't have write access to directory /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/
for CIL metadata transform on staging yet. I saw it failed with error Read-only file system
when damsmanager tried to create the output file cil_excel_headings.csv
. Could you check whether the tomcat
user for damsmanager on staging can write to directory /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/
or not? Thanks.
Here is the error I got from the tomcat log on staging:
java.io.FileNotFoundException: /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv (Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
at edu.ucsd.library.xdre.web.CILHarvestingTaskController.writeContent(CILHarvestingTaskController.java:208)
@rstanonik Have you got a chance to look into the write access permission on staging for damsmanager? The tomcat user need write access to dams_staging /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/
for CIL ingest. Thanks.
I'm giving tomcat user rw access now, but it will take a while, there are over 1 million files. In which environments? prod, staging, qa?
@rstanonik Thanks. Yes, while moving forward, I think we need that to be setup for prod and QA as well.
@lsitu Try now, tomcat user should have rw access in prod, staging, and qa.
Thanks @rstanonik. I'll run a test for it.
@arwenhutt The CIL metadata transformation process is finished over the weekend. And I think the transformed CSV output is ready for you to review now. Thanks.
Here is the location of the output: /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv
@lsitu great! @abbypenn93 I don't think I'll be able to look at this till Thursday, you don't need to wait for me if you have time before then, but we can schedule some time Thursday to look at it together. Sound good?
Sounds good.
When reviewing the transformed metadata, we found the following issues. Please let me know if you have any questions.
Records missing from output file:
Copyright items are included in the output file, these should be excluded as part of Step 2.b. of the process (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit?pli=1#gid=1321122337)
Component format needs to be updated (includes backslashes):
Content missing from output file for CELLTYPE, CELLULARCOMPONENT, HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION.
For instance, object 39793:
CELLTYPE: subject:anatomy field should have: cell by organism, eukaryotic cell, Eukaryotic Protist, Ciliated Protist
CELLULARCOMPONENT: subject:anatomy field should have contractile vacuole, contractile vacuole pore, contractile vacuolar membrane, cell cortex
For HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION: have not found any data for these fields in the output file.
End date - column missing from output file.
Related resource:related field - components reversed:
TERMSANDCONDITIONS - no copyright note column in output file.
Note:preferred citation - no column in output file.
Thanks @abbypenn93 . For your review comments, I added my questions below starting with >: When reviewing the transformed metadata, we found the following issues. Please let me know if you have any questions.
> Do you have a couple of examples that are missing so that I can inspect them specifically to see why they are missing?
> What's the rules for copyright items that can be applied to exclude them from the CSV output?
> It seems like there's a gap there. I was thinking about that the heading ingest file
in [4a] is the CSV heading output itself that you are reviewing. Now I see you are saying Subject heading ingest file
. I'll see how to produce it in the next step.
> Sure. This new syntax will be applied next.
For instance, object 39793:
CELLTYPE: subject:anatomy field should have: cell by organism, eukaryotic cell, Eukaryotic Protist, Ciliated Protist
CELLULARCOMPONENT: subject:anatomy field should have contractile vacuole, contractile vacuole pore, contractile vacuolar membrane, cell cortex
>I'll look into object 39793 for the missing subject:anatomy
fields.
_> Could you give me an examples that contains the fields for HUMAN_DEV_ANATOMY
, HUMAN_DISEASE
and MOLECULARFUNCTION
? I don't see these fields in object 39793._
_> Could you give me more instructions regarding how multiple values in CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE
should be mapped to date:creation
value, beginDate
and endDate
with examples? I think we may need examples that only has the beginDate
but no endDate
if any such item exists._
> Do you have an example that contains the above Related resource:related
field?
> I don't think we have the copyright note
header/field in our current Excel Standard InputStream. What's the instruction to convert the TERMSANDCONDITIONS
field into the copyright element?
> Do you have an example that contains the Note:preferred citation
field so that I can take a look?
Thanks.
Responses to 5/6/19 post (organized by original item number):
Item 1. Records missing
Do you have a couple of examples that are missing so that I can inspect them specifically to see why they are missing?
Examples of missing json files include: 2, 111, 120, 122-126, 130
Item 2. Copyright
What's the rules for copyright items that can be applied to exclude them from the CSV output?
From document: https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit
Content for harvesting identified. Conditions: Not already harvested See OLR file, but essentially loop through directory in git repo and see if IDs exist in the dams already.
Not under copyright CIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_text != copyright
Item 5: Missing content
For HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION: have not found any data for these fields in the output file.
Could you give me an examples that contains the fields for HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION? HUMAN_DEV_ANATOMY (maps to subject:anatomy):
Example 1: appears in 34598.json but not in output file: "HUMAN_DEV_ANATOMY": [ { "onto_name": "liver",
Example 2: appears in 37223.json but not in output file:
"HUMAN_DEV_ANATOMY": [ { "onto_name": "superior cervical ganglion",
HUMAN_DISEASE (maps to subject:topic) Example 1: appears in 10457.json but not in output file: "HUMAN_DISEASE": [ { "onto_name": "toxoplasmosis",
Example 1: appears in 32212.json but not in output file: "HUMAN_DISEASE": [ { "free_text": "prostate adenocarcinoma"
MOLECULARFUNCTION (maps to subject:topic field):
Correction: Please note that this field is present in some records. For instance object 10465 in the output file does contain the correct value for MOLECULARFUNCTION.
Example 1: appears in 12300.json but not in output file:
"MOLECULARFUNCTION": [ { "onto_name": "structural constituent of cytoskeleton", "onto_id": "GO:0005200" }, { "onto_name": "structural molecule activity", "onto_id": "GO:0005198"
Item 6: End date - column missing from output file.
Could you give me more instructions regarding how multiple values in CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE should be mapped to date:creation value, beginDate and endDate with examples? I think we may need examples that only has the beginDate but no endDateif any such item exists CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE = Date:created = Begin date = End date. All these dates, when available, will be the same.
Item 7. Related resource:related field - components reversed:
Do you have an example that contains the above Related resource:related field? Example: object 37065 https://doi.org/doi:10.7295/W9CIL37065 @ Source Record in the Cell Image Library
Item 8. TERMSANDCONDITIONS - no copyright note column in output file.
I don't think we have the copyright note header/field in our current Excel Standard InputStream. What's the instruction to convert the TERMSANDCONDITIONS field into the copyright element? There is no header/field in our Excel Standard InputStream but DOMM will use the information contained in the CIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_textCIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_text field to assign copyright information during ingest.
Item 9. Note:preferred citation - no column in output file.
Do you have an example that contains the Note:preferred citation field so that I can take a look? Instructions for preferred citation from “CIL Processing and Mapping Instructions” document: CIL_CCDB.CIL.Citation.Title Replace YYYY value in "(YYYY)" with current year. Replace text "CIL. Dataset" with "In Cell Image Library. UC San Diego Library Digital Collections. Dataset. DOI_placeholder" E.g. Sanford Palay (20112018) CIL:10790, Rattus, brush border epithelial cell. CIL. Dataset In Cell Image Library. UC San Diego Library Digital Collections. Dataset. DOI_placeholder
Please let me know if you have any other questions, Abby
Additional item: The dates in the Date:creation and Begin date (and ultimately End date) fields need to follow YYYY-MM-DD format but currently do not. For instance, for Date:creation, object 37225 shows 10/27/54 in the output file.
@abbypenn93 Thank you so much. I'll go over all these and correct it.
Just want to clarify that in the Additional item
above, the Begin date (and ultimately End date) fields
in object 37225 are following the YYYY-MM-DD format already. If you open it with a text editor, you will see date value 1954-10-27
. For the Date:creation
value, I think we can make it the same if all three values are the same and the data value could be parsed.
@mcritchlow Basing on the QA comments from @abbypenn93 and our discussions, I've created PR https://github.com/ucsdlib/damsmanager/pull/328 to fix the CIL mapping issues. It's ready for review now. Thanks.
@mcritchlow if the PR automatically closes this ticket, can you reopen it for the next round of output QA?
Hi All, We could use an update on this project--are you ready for DOMM to do another round of metadata transformation QA? Thanks, Abby
Thanks for the ping, @abbypenn93. I think we're ready for another round of QA, unless @lsitu has a new update since May 24? I believe the work he's been doing with Willy only has to do with harvesting the data files themselves. The metadata harvesting work should be independent of that.
@hjsyoo / @abbypenn93 We've got codings for the QA work and Willy's REST API update ready and we can deploy it to staging for review early next week. Both need to deploy damsmanager to staging for test, and the CIL metadata on Github is out dated with lost of missing videos so I think we had better review them all together. I'll initiate another round of CIL harvesting once damsmanager is deploy to staging next week.
@arwenhutt / @abbypenn93 The new round of CIL metadata transformation process is finished and the transformed CSV outputs are ready for you to review now: /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_object_input.csv
/rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_subject_headings.csv
That's good news, thanks.
I'm just about to report my QA findings. Looking better, but more work to do. Thanks, Abby
Status of issues reported in May (Ticket: https://github.com/ucsdlib/damsmanager/issues/317):
Status: Fixed: found 10081 unique objects in the spreadsheet = 10081 .json files in /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_source;
However, please see notes below (item 2) about the records with the "copyright" attribute.
Status: Open
There are 400 .json files with "TERMSANDCONDITIONS": {"free_text": "copyright"}.
Essentially, only harvest .json files from Cell Image Library that DO NOT contain "TERMSANDCONDITIONS": {"free_text": "copyright"}
A few examples from the current harvest that contain "TERMSANDCONDITIONS": {"free_text": "copyright"}: 7596, 35148, 35589, 7752, 12818, 36412, 22705
Status: Fixed - Subject file is present.
Open - in addition to the term (from onto_name), the onto_id should be placed in the "closeMatch" column of subject heading document.
Status: Fixed
Status: Fixed
Status: Fixed
Status: Fixed
Example: https://doi.org/doi:10.7295/W9CIL40901 @ Source Record in the Cell Image Library
Status: Open
9 . Note:preferred citation - no column in output file (Note that the content is the same as in the Title column [CIL_CCCDB.Citation.Title])
Status: Open
New issues (based on 2019 July 3 harvest)
Can the file /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv be moved or deleted--it appears to be from that last harvest?
Please confirm that the harvester is looping through the directory in git repo to see if IDs already exist in the dams.
Source data issue: some .json files (and corresponding CCDB records) do not contain a CIL_CCDB.Citation.Title section.
Compare 50351 (http://www.cellimagelibrary.org/images/50351) with 243 (http://www.cellimagelibrary.org/images/243) (which has a citation).
Some of the source files that don't have CIL_CCDB.Citation.Title section: 50351 50352 50353 50354 50401 50451 50452 50453 50454 50512 50513 50514 50515 50516 50517 50518 50519 50520 50521
Solution: where CIL_CCDB_Citation.Title does not exist, set Title = Object Unique ID and Note:preferred citation = Object Unique ID
Note that this is different from Item 12. For these objects: -- zips, tifs,and jpg files are present -- CIL_CCDB.Citation.Title (citations) are present in the json
Examples: 49451 49453 49651 49701 49751 49752 49753 49754 49755 49756 49757 49758 49759"
There are 2 Date:creation columns (one is empty)
Person:researcher - name missing where > 1 name listed in the source file --when this occurs, only the last name on the list is returned --some objects are correct, for example 7105-7108, where 2 of 2 researchers are returned
Example objects with missing researcher names: 2, 2592, 1030, 1031, 1032, 1033
Object 2: One of two researchers missing (transformation returned Trudy Aebig; json reads "ATTRIBUTION": { "Contributors": [ "Linda Parysek", "Trudy Aebig"
See "CIL Processing and Mapping Instructions" document (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337), cell 51D.
Examples: 12577 - @ depts.washington.edu/fishscop/
140 - @ http://images.nigms.nih.gov/index.cfm?event=viewDetail&imageID=2456
36424 - @ http://ccdb.ucsd.edu/sand/main?event=displayAll&mpid=67
25609 - @ http://intl.pnas.org/content/105/29/10017.full"
Type of Resource appears in two columns (has been split on pipe) but should appear in one. All object rows should read "data | still image"
This note is for future reference in the event that the situation arises more frequently with future harvests.
No action necessary at this time.
Note:description - for object 10016: split on pipe in text so that the first Note:description field contains "Issue 2", the second description field contains the first part of the description, and the third description field contains "Volume 7"
"IMAGEDESCRIPTION": {"free_text": "NIH 3T3 cell (mouse embryonic fibroblast line)\nstained for Actin (green) and DNA (blue).\n\nPLoS Biology February 2009|Volume 7|Issue 2| e1000038 \n\nActive-Site Inhibitors of mTOR Target Rapamycin-Resistant Outputs of mTORC1 and mTORC2 \n\nMorris E. Feldman, Beth Apsel, Aino Uotila, Robbie Loewith, Zachary A. Knight, Davide Ruggero, Kevan M. Shokat"},
10016 is the only object that uses these two extra description fields.
For example, in object 3216, IMAGEDESCRIPTION free_text (Note:description):
Tissue section of human prostate containing adenocarcinoma that has been immunostained for the cell-surface antigen BXP34. Nuclei are stained in blue. This image is part of a large collection of images generated from numerous specimens to characterize the distribution of BXP34 in human prostate tissue. A summary of the entire data set is provided below. No summary is available for BXP34 immunostain of human prostate.
\
\
This image is part of a large collection of immunohistochemistry images of cell-surface antigens generated by the SCGAP Urologic Epithelial Stem Cells (UESC) Project. The overall goal of the project is to characterize and isolate epithelial stem cell populations from two urologic organs, the prostate and bladder. Links are provided below for the UESC Project database, the entire human prostate immunostain summary, the BXP34 immunostain summary, and information on the specimen that this image is from. Other images of BXP34 human prostate immunostains are accessible following the group link.
Another example is 32178:
\
\
Cell Type | % intense | % equivocal | % none | # assays |
---|
Cell Type | % intense | % equivocal | % none | # assays |
---|
Descriptive summary
Review transformation output from CIL harvest process and update mapping/script.
Processing & mapping instructions
part of #316