ucsdlib / damsmanager

DAMS Manager
Other
3 stars 1 forks source link

CIL metadata transformation QA #317

Closed arwenhutt closed 4 years ago

arwenhutt commented 5 years ago

Descriptive summary

Review transformation output from CIL harvest process and update mapping/script.

Processing & mapping instructions

part of #316

arwenhutt commented 5 years ago

@lsitu I think we can go ahead with reviewing the metadata transformation before the updated process for harvesting files is figured out. It looks like the json files were downloaded to staging, can we get the metadata ingest files from step 4? Thanks!

lsitu commented 5 years ago

@arwenhutt Yes. I think we can start to verify it once Release 2.71 is done.

arwenhutt commented 5 years ago

@lsitu great, thanks!

lsitu commented 5 years ago

@rstanonik It looks like damsmanager don't have write access to directory /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/ for CIL metadata transform on staging yet. I saw it failed with error Read-only file system when damsmanager tried to create the output file cil_excel_headings.csv. Could you check whether the tomcat user for damsmanager on staging can write to directory /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/ or not? Thanks.

Here is the error I got from the tomcat log on staging:

java.io.FileNotFoundException: /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv (Read-only file system)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
    at edu.ucsd.library.xdre.web.CILHarvestingTaskController.writeContent(CILHarvestingTaskController.java:208)
lsitu commented 5 years ago

@rstanonik Have you got a chance to look into the write access permission on staging for damsmanager? The tomcat user need write access to dams_staging /pub/data2/damsmanager/dams_staging/rdcp-staging/rdcp-0126-cil/ for CIL ingest. Thanks.

rstanonik commented 5 years ago

I'm giving tomcat user rw access now, but it will take a while, there are over 1 million files. In which environments? prod, staging, qa?

lsitu commented 5 years ago

@rstanonik Thanks. Yes, while moving forward, I think we need that to be setup for prod and QA as well.

rstanonik commented 5 years ago

@lsitu Try now, tomcat user should have rw access in prod, staging, and qa.

lsitu commented 5 years ago

Thanks @rstanonik. I'll run a test for it.

lsitu commented 5 years ago

@arwenhutt The CIL metadata transformation process is finished over the weekend. And I think the transformed CSV output is ready for you to review now. Thanks.

Here is the location of the output: /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv

arwenhutt commented 5 years ago

@lsitu great! @abbypenn93 I don't think I'll be able to look at this till Thursday, you don't need to wait for me if you have time before then, but we can schedule some time Thursday to look at it together. Sound good?

abbypenn93 commented 5 years ago

Sounds good.

abbypenn93 commented 5 years ago

When reviewing the transformed metadata, we found the following issues. Please let me know if you have any questions.

  1. Records missing from output file:

    • 10,023 json files were downloaded, 400 are under copyright, so there should be 9623 objects represented in the processed metadata file (cil_excel_headings.csv).
    • cil_excel_headings.csv has 7979 objects, so 1,644 items are missing.
  2. Copyright items are included in the output file, these should be excluded as part of Step 2.b. of the process (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit?pli=1#gid=1321122337)

  1. Subject heading ingest file not generated (step 4.a. of the process):
  1. Component format needs to be updated (includes backslashes):

    • \Component
  2. Content missing from output file for CELLTYPE, CELLULARCOMPONENT, HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION.

  1. End date - column missing from output file.

  2. Related resource:related field - components reversed:

  3. TERMSANDCONDITIONS - no copyright note column in output file.

  4. Note:preferred citation - no column in output file.

lsitu commented 5 years ago

Thanks @abbypenn93 . For your review comments, I added my questions below starting with >: When reviewing the transformed metadata, we found the following issues. Please let me know if you have any questions.

  1. Records missing from output file:
    • 10,023 json files were downloaded, 400 are under copyright, so there should be 9623 objects represented in the processed metadata file (cil_excel_headings.csv).
    • cil_excel_headings.csv has 7979 objects, so 1,644 items are missing.

> Do you have a couple of examples that are missing so that I can inspect them specifically to see why they are missing?

  1. Copyright items are included in the output file, these should be excluded as part of Step 2.b. of the process (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit?pli=1#gid=1321122337)

> What's the rules for copyright items that can be applied to exclude them from the CSV output?

  1. Subject heading ingest file not generated (step 4.a. of the process):

> It seems like there's a gap there. I was thinking about that the heading ingest file in [4a] is the CSV heading output itself that you are reviewing. Now I see you are saying Subject heading ingest file. I'll see how to produce it in the next step.

  1. Component format needs to be updated (includes backslashes):
    • \Component

> Sure. This new syntax will be applied next.

  1. Content missing from output file for CELLTYPE, CELLULARCOMPONENT, HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION.

>I'll look into object 39793 for the missing subject:anatomy fields.

_> Could you give me an examples that contains the fields for HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION? I don't see these fields in object 39793._

  1. End date - column missing from output file.

_> Could you give me more instructions regarding how multiple values in CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE should be mapped to date:creation value, beginDate and endDate with examples? I think we may need examples that only has the beginDate but no endDate if any such item exists._

  1. Related resource:related field - components reversed:

> Do you have an example that contains the above Related resource:related field?

  1. TERMSANDCONDITIONS - no copyright note column in output file.

> I don't think we have the copyright note header/field in our current Excel Standard InputStream. What's the instruction to convert the TERMSANDCONDITIONS field into the copyright element?

  1. Note:preferred citation - no column in output file.

> Do you have an example that contains the Note:preferred citation field so that I can take a look?

Thanks.

abbypenn93 commented 5 years ago

Responses to 5/6/19 post (organized by original item number):

Item 1. Records missing

Do you have a couple of examples that are missing so that I can inspect them specifically to see why they are missing?

Examples of missing json files include: 2, 111, 120, 122-126, 130

Item 2. Copyright

What's the rules for copyright items that can be applied to exclude them from the CSV output?

From document: https://docs.google.com/document/d/1Eg2024XATxuwdzFtoLTNujalK2SDR_s6-Sg6zyUC9wE/edit

Content for harvesting identified. Conditions: Not already harvested See OLR file, but essentially loop through directory in git repo and see if IDs exist in the dams already.

Not under copyright CIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_text != copyright

Item 5: Missing content

For HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION: have not found any data for these fields in the output file.

Could you give me an examples that contains the fields for HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION? HUMAN_DEV_ANATOMY (maps to subject:anatomy):

Example 1: appears in 34598.json but not in output file: "HUMAN_DEV_ANATOMY": [ { "onto_name": "liver",

Example 2: appears in 37223.json but not in output file:

"HUMAN_DEV_ANATOMY": [ { "onto_name": "superior cervical ganglion",

HUMAN_DISEASE (maps to subject:topic) Example 1: appears in 10457.json but not in output file: "HUMAN_DISEASE": [ { "onto_name": "toxoplasmosis",

Example 1: appears in 32212.json but not in output file: "HUMAN_DISEASE": [ { "free_text": "prostate adenocarcinoma"

MOLECULARFUNCTION (maps to subject:topic field):

Correction: Please note that this field is present in some records. For instance object 10465 in the output file does contain the correct value for MOLECULARFUNCTION.

Example 1: appears in 12300.json but not in output file:

"MOLECULARFUNCTION": [ { "onto_name": "structural constituent of cytoskeleton", "onto_id": "GO:0005200" }, { "onto_name": "structural molecule activity", "onto_id": "GO:0005198"

Item 6: End date - column missing from output file.

Could you give me more instructions regarding how multiple values in CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE should be mapped to date:creation value, beginDate and endDate with examples? I think we may need examples that only has the beginDate but no endDateif any such item exists CIL_CCDB.CIL.CORE.ATTRIBUTION.DATE = Date:created = Begin date = End date. All these dates, when available, will be the same.

Item 7. Related resource:related field - components reversed:

Do you have an example that contains the above Related resource:related field? Example: object 37065 https://doi.org/doi:10.7295/W9CIL37065 @ Source Record in the Cell Image Library

Item 8. TERMSANDCONDITIONS - no copyright note column in output file.

I don't think we have the copyright note header/field in our current Excel Standard InputStream. What's the instruction to convert the TERMSANDCONDITIONS field into the copyright element? There is no header/field in our Excel Standard InputStream but DOMM will use the information contained in the CIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_textCIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_text field to assign copyright information during ingest.

Item 9. Note:preferred citation - no column in output file.

Do you have an example that contains the Note:preferred citation field so that I can take a look? Instructions for preferred citation from “CIL Processing and Mapping Instructions” document: CIL_CCDB.CIL.Citation.Title Replace YYYY value in "(YYYY)" with current year. Replace text "CIL. Dataset" with "In Cell Image Library. UC San Diego Library Digital Collections. Dataset. DOI_placeholder" E.g. Sanford Palay (20112018) CIL:10790, Rattus, brush border epithelial cell. CIL. Dataset In Cell Image Library. UC San Diego Library Digital Collections. Dataset. DOI_placeholder

Please let me know if you have any other questions, Abby

abbypenn93 commented 5 years ago

Additional item: The dates in the Date:creation and Begin date (and ultimately End date) fields need to follow YYYY-MM-DD format but currently do not. For instance, for Date:creation, object 37225 shows 10/27/54 in the output file.

lsitu commented 5 years ago

@abbypenn93 Thank you so much. I'll go over all these and correct it. Just want to clarify that in the Additional item above, the Begin date (and ultimately End date) fields in object 37225 are following the YYYY-MM-DD format already. If you open it with a text editor, you will see date value 1954-10-27. For the Date:creation value, I think we can make it the same if all three values are the same and the data value could be parsed.

lsitu commented 5 years ago

@mcritchlow Basing on the QA comments from @abbypenn93 and our discussions, I've created PR https://github.com/ucsdlib/damsmanager/pull/328 to fix the CIL mapping issues. It's ready for review now. Thanks.

arwenhutt commented 5 years ago

@mcritchlow if the PR automatically closes this ticket, can you reopen it for the next round of output QA?

abbypenn93 commented 5 years ago

Hi All, We could use an update on this project--are you ready for DOMM to do another round of metadata transformation QA? Thanks, Abby

hjsyoo commented 5 years ago

Thanks for the ping, @abbypenn93. I think we're ready for another round of QA, unless @lsitu has a new update since May 24? I believe the work he's been doing with Willy only has to do with harvesting the data files themselves. The metadata harvesting work should be independent of that.

lsitu commented 5 years ago

@hjsyoo / @abbypenn93 We've got codings for the QA work and Willy's REST API update ready and we can deploy it to staging for review early next week. Both need to deploy damsmanager to staging for test, and the CIL metadata on Github is out dated with lost of missing videos so I think we had better review them all together. I'll initiate another round of CIL harvesting once damsmanager is deploy to staging next week.

lsitu commented 4 years ago

@arwenhutt / @abbypenn93 The new round of CIL metadata transformation process is finished and the transformed CSV outputs are ready for you to review now: /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_object_input.csv

/rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_subject_headings.csv

abbypenn93 commented 4 years ago

That's good news, thanks.

abbypenn93 commented 4 years ago

I'm just about to report my QA findings. Looking better, but more work to do. Thanks, Abby

abbypenn93 commented 4 years ago

Status of issues reported in May (Ticket: https://github.com/ucsdlib/damsmanager/issues/317):

  1. Records missing from output file:Found 10081 unique objects in the spreadsheet = 10081 .json files in /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_source; however please see notes below about the records with the "copyright" attribute (Item 8).

Status: Fixed: found 10081 unique objects in the spreadsheet = 10081 .json files in /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_source;

However, please see notes below (item 2) about the records with the "copyright" attribute.

  1. Copyright items are included in the harvest; these should be excluded as part of Step 2.b. of the harvesting process (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit?pli=1#gid=1321122337)

Status: Open

There are 400 .json files with "TERMSANDCONDITIONS": {"free_text": "copyright"}.

Essentially, only harvest .json files from Cell Image Library that DO NOT contain "TERMSANDCONDITIONS": {"free_text": "copyright"}

A few examples from the current harvest that contain "TERMSANDCONDITIONS": {"free_text": "copyright"}: 7596, 35148, 35589, 7752, 12818, 36412, 22705

  1. Subject heading ingest file not generated (step 4.a. of the harvest process):

Status: Fixed - Subject file is present.

Open - in addition to the term (from onto_name), the onto_id should be placed in the "closeMatch" column of subject heading document.

  1. Component format needs to be updated (includes backslashes):
    \Component

Status: Fixed

  1. Content missing from output file for CELLTYPE, CELLULARCOMPONENT, HUMAN_DEV_ANATOMY, HUMAN_DISEASE and MOLECULARFUNCTION.

Status: Fixed

  1. End date - column missing from output file.

Status: Fixed

  1. Related resource:related field - components reversed:

Status: Fixed

Example: https://doi.org/doi:10.7295/W9CIL40901 @ Source Record in the Cell Image Library

  1. TERMSANDCONDITIONS - no copyright note column in output file.

Status: Open

9 . Note:preferred citation - no column in output file (Note that the content is the same as in the Title column [CIL_CCCDB.Citation.Title])

Status: Open

New issues (based on 2019 July 3 harvest)

  1. Can the file /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv be moved or deleted--it appears to be from that last harvest?

  2. Please confirm that the harvester is looping through the directory in git repo to see if IDs already exist in the dams.

  3. Source data issue: some .json files (and corresponding CCDB records) do not contain a CIL_CCDB.Citation.Title section.

Compare 50351 (http://www.cellimagelibrary.org/images/50351) with 243 (http://www.cellimagelibrary.org/images/243) (which has a citation).

Some of the source files that don't have CIL_CCDB.Citation.Title section: 50351 50352 50353 50354 50401 50451 50452 50453 50454 50512 50513 50514 50515 50516 50517 50518 50519 50520 50521

Solution: where CIL_CCDB_Citation.Title does not exist, set Title = Object Unique ID and Note:preferred citation = Object Unique ID

  1. Some objects are missing titles and components.

Note that this is different from Item 12. For these objects: -- zips, tifs,and jpg files are present -- CIL_CCDB.Citation.Title (citations) are present in the json

Examples: 49451 49453 49651 49701 49751 49752 49753 49754 49755 49756 49757 49758 49759"

  1. There are 2 Date:creation columns (one is empty)

  2. Person:researcher - name missing where > 1 name listed in the source file --when this occurs, only the last name on the list is returned --some objects are correct, for example 7105-7108, where 2 of 2 researchers are returned

Example objects with missing researcher names: 2, 2592, 1030, 1031, 1032, 1033

Object 2: One of two researchers missing (transformation returned Trudy Aebig; json reads "ATTRIBUTION": { "Contributors": [ "Linda Parysek", "Trudy Aebig"

  1. Related resource: related - where there is no label for a web link in the source file, the text "Related resourse @ Href " should be added.

See "CIL Processing and Mapping Instructions" document (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337), cell 51D.

Examples: 12577 - @ depts.washington.edu/fishscop/

140 - @ http://images.nigms.nih.gov/index.cfm?event=viewDetail&imageID=2456

36424 - @ http://ccdb.ucsd.edu/sand/main?event=displayAll&mpid=67

25609 - @ http://intl.pnas.org/content/105/29/10017.full"

  1. Type of Resource appears in two columns (has been split on pipe) but should appear in one. All object rows should read "data | still image"

  2. This note is for future reference in the event that the situation arises more frequently with future harvests.

No action necessary at this time.

Note:description - for object 10016: split on pipe in text so that the first Note:description field contains "Issue 2", the second description field contains the first part of the description, and the third description field contains "Volume 7"

"IMAGEDESCRIPTION": {"free_text": "NIH 3T3 cell (mouse embryonic fibroblast line)\nstained for Actin (green) and DNA (blue).\n\nPLoS Biology February 2009|Volume 7|Issue 2| e1000038 \n\nActive-Site Inhibitors of mTOR Target Rapamycin-Resistant Outputs of mTORC1 and mTORC2 \n\nMorris E. Feldman, Beth Apsel, Aino Uotila, Robbie Loewith, Zachary A. Knight, Davide Ruggero, Kevan M. Shokat"},

10016 is the only object that uses these two extra description fields.

  1. HTML elements are not being processed. Will these records display properly in the DAMS?

For example, in object 3216, IMAGEDESCRIPTION free_text (Note:description):

Tissue section of human prostate containing adenocarcinoma that has been immunostained for the cell-surface antigen BXP34. Nuclei are stained in blue. This image is part of a large collection of images generated from numerous specimens to characterize the distribution of BXP34 in human prostate tissue. A summary of the entire data set is provided below. No summary is available for BXP34 immunostain of human prostate.

\
\
This image is part of a large collection of immunohistochemistry images of cell-surface antigens generated by the SCGAP Urologic Epithelial Stem Cells (UESC) Project. The overall goal of the project is to characterize and isolate epithelial stem cell populations from two urologic organs, the prostate and bladder. Links are provided below for the UESC Project database, the entire human prostate immunostain summary, the BXP34 immunostain summary, and information on the specimen that this image is from. Other images of BXP34 human prostate immunostains are accessible following the group link.

Another example is 32178: \
\

\ \

\

[backslashes where added to the text to preserve how the text appears in the spreadsheet]

There are a bunch of Note:descriptions in the 32167-33147 range of objects that have this issue.

Note:technical details Section

  1. All of the fields below should map to a single Note:technical details column, but currently only PREPARATION and RELATIONTOINTACTCELL. Please see "CIL Processing and Mapping Instructions" document, cell 45D.

CIL_CCDB.CIL.CORE.PREPARATION.onto_name CIL_CCDB.CIL.CORE.PREPARATION.free_text CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.onto_name CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.free_text CIL_CCDB.CIL.CORE.ITEMTYPE.onto_name CIL_CCDB.CIL.CORE.ITEMTYPE.free_text CIL_CCDB.CIL.CORE.IMAGINGMODE.onto_name CIL_CCDB.CIL.CORE.IMAGINGMODE.free_text CIL_CCDB.CIL.CORE.PARAMETERIMAGED.onto_name CIL_CCDB.CIL.CORE.PARAMETERIMAGED.free_text CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.onto_name CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.free_text CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.onto_name CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.free_text CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.onto_name CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.free_text CIL_CCDB.CIL.CORE.DATAQUALIFICATION.onto_name CIL_CCDB.CIL.CORE.DATAQUALIFICATION.free_text

  1. Note:technical details, Relation to intact cell - where PREPARATION field contains more than one term, RELATIONTOINTACTCELL is repeated.

Example objects: 742, 743, 745 "Preparation: glutaraldehyde fixed tissue Relation to intact cell: isolated subcellular component"
"Preparation: unembedded tissue
Relation to intact cell: isolated subcellular component"

  1. Whenever terms are mapped to Note:technical details and there is > 1 term, separate each term with a semicolon.

Examples: 33147, 36274, 7825

For instance, 33147 returns: Imaging mode: bright-field microscopy Imaging mode: widefield illumination

---Should read "Imaging mode: bright-field microscopy; widefield illumination"

33147 returns: Source of contrast: differences in adsorption or binding of stain Source of contrast: differences in intrinsic optical density Source of contrast: distribution of a specific protein

--Should read: "Source of contrast: differences in adsorption or binding of stain; differences in intrinsic optical density; distribution of a specific protein"

_Note: technical details issues related to the "freetext" tag

  1. Note: technical details, SOURCEOFCONTRAST missing from some objects where the content appears after a "free_text" tag.

Examples: object 2, 194, 238, 239, 7118

9067 total SOURCEOFCONTRAST elements in .json files; Excel found 3146 instances

  1. Note:technical details, PREPARATION missing from some objects where the content appears after a "free_text" tag.

Examples (only found 3 cases): 12620 {"free_text": "dehydrated in ethanol"} 140 {"free_text": "in vitro assembly"} 36812 {"free_text": "in vitro assembly"}"

  1. Note:technical details, PROCESSINGHISTORY missing from some objects where the content appears after a "free_text" tag.

Examples: 120 - 125: {"free_text": "unprocessed raw image"} 10008 {"free_text": "Print from negative scanned for Photoshop."}

  1. Note:technical details, PARAMETERIMAGED missing content ("PARAMETERIMAGED": {"free_text": "specimen height"},) -- notice that the term appears in "free_text" in this instance, whereas in object 40596, the term appears with under "onto_name" and with an "onto_id": "PARAMETERIMAGED": {"onto_name": "elastic scattering of electrons","onto_id": "FBbi:00000586"}

Examples where the term is not returned: Objects 7101 - 7123

  1. Note:technical details - DATAQUALIFICATION - data not present in spreadsheet; all of the entries in this field are associated with a "free_text" tag.

Oxygen shows 5306 instances of this field.

Please let me know if you have any questions. Thanks, Abby

lsitu commented 4 years ago

@abbypenn93 Thank you very much for the hard work. I think we may need to clarify the open issues and new issues (see section Q: for my comments):

  1. Subject heading ingest file not generated (step 4.a. of the harvest process):

Status: Fixed - Subject file is present.

Open - in addition to the term (from onto_name), the onto_id should be placed in the "closeMatch" column of subject heading document.

Q: The term closeMatch doesn't have its corresponding mapping in the Excel Import Stream tool. I think it will be a problem if subjects with the same subject term but different closeMatch values. We can manipulate to export it to the subject headings output and see how it goes.

  1. TERMSANDCONDITIONS - no copyright note column in output file.

Status: Open

Q: The mapping of the TERMSANDCONDITIONS to copyright note is not a valid header in Excel Import Stream tool, nor a valid header in the new Batch Import/Export tools. I think that's why it's missing. Should we use the term copyrightNote as what's used in Batch Import/Export tools?

New issues (based on 2019 July 3 harvest)

  1. Can the file /rdcp-staging/rdcp-0126-cil/cil_harvest_2019-03-07/metadata_processed/cil_excel_headings.csv be moved or deleted--it appears to be from that last harvest?

Q: Yes. I think you can simply delete it if you will be no longer need it.

  1. Please confirm that the harvester is looping through the directory in git repo to see if IDs already exist in the dams.

Q: We are no longer use the git repo but the CIL public API to download JSON sources. All JSON files downloaded should be processed. If you see anything is missing, please bring it up and we can see what's the issue.

  1. Source data issue: some .json files (and corresponding CCDB records) do not contain a CIL_CCDB.Citation.Title section.

Compare 50351 (http://www.cellimagelibrary.org/images/50351) with 243 (http://www.cellimagelibrary.org/images/243) (which has a citation).

Some of the source files that don't have CIL_CCDB.Citation.Title section: 50351 50352 50353 50354 50401 50451 50452 50453 50454 50512 50513 50514 50515 50516 50517 50518 50519 50520 50521

Solution: where CIL_CCDB_Citation.Title does not exist, set Title = Object Unique ID and Note:preferred citation = Object Unique ID

Q: Will do.

  1. Some objects are missing titles and components.

Note that this is different from Item 12. For these objects: -- zips, tifs,and jpg files are present -- CIL_CCDB.Citation.Title (citations) are present in the json

Examples: 49451 49453 49651 49701 49751 49752 49753 49754 49755 49756 49757 49758 49759"

Q: Hmm, something weird that may need to inspect. Not sure whether it's related to the format of the JSON source or not.

  1. There are 2 Date:creation columns (one is empty)

Q: May be this is caused by the inconsistent of the JSON source for Date:creation mapping. We'll see how to eliminate the empty column. I see some records have one date value, while others have several date values that may cause the problem.

  1. Person:researcher - name missing where > 1 name listed in the source file --when this occurs, only the last name on the list is returned --some objects are correct, for example 7105-7108, where 2 of 2 researchers are returned

Q: It looks like the source format are different here. Object 7105 actually has only one value for two researchers. How can we deal with it?

"ATTRIBUTION": {
                    "Contributors": [
                        "Luda Shlyaktenko, Chris Woodcock"
                    ],

Example objects with missing researcher names: 2, 2592, 1030, 1031, 1032, 1033

Object 2: One of two researchers missing (transformation returned Trudy Aebig; json reads "ATTRIBUTION": { "Contributors": [ "Linda Parysek", "Trudy Aebig"

Q: Will fix it.

  1. Related resource: related - where there is no label for a web link in the source file, the text "Related resourse @ Href " should be added.

See "CIL Processing and Mapping Instructions" document (https://docs.google.com/spreadsheets/d/1NnYw3bgNraJ9hZCH1SOKijh_JvJh3So1LpDC24vu7h8/edit#gid=1321122337), cell 51D.

Examples: 12577 - @ depts.washington.edu/fishscop/

140 - @ http://images.nigms.nih.gov/index.cfm?event=viewDetail&imageID=2456

36424 - @ http://ccdb.ucsd.edu/sand/main?event=displayAll&mpid=67

25609 - @ http://intl.pnas.org/content/105/29/10017.full"

Q: Okay. Will append text Related resource to make it Related resource @ Href.

  1. Type of Resource appears in two columns (has been split on pipe) but should appear in one. All object rows should read "data | still image"

Q: I think this is correct since all values are splitting on pipe fro single ones. I think it's not a good idea to do custom handling of Type of Resource to make it into one column unless we see the benefits of doing that.

  1. This note is for future reference in the event that the situation arises more frequently with future harvests.

No action necessary at this time.

Note:description - for object 10016: split on pipe in text so that the first Note:description field contains "Issue 2", the second description field contains the first part of the description, and the third description field contains "Volume 7"

"IMAGEDESCRIPTION": {"free_text": "NIH 3T3 cell (mouse embryonic fibroblast line)\nstained for Actin (green) and DNA (blue).\n\nPLoS Biology February 2009|Volume 7|Issue 2| e1000038 \n\nActive-Site Inhibitors of mTOR Target Rapamycin-Resistant Outputs of mTORC1 and mTORC2 \n\nMorris E. Feldman, Beth Apsel, Aino Uotila, Robbie Loewith, Zachary A. Knight, Davide Ruggero, Kevan M. Shokat"},

10016 is the only object that uses these two extra description fields.

  1. HTML elements are not being processed. Will these records display properly in the DAMS?

For example, in object 3216, IMAGEDESCRIPTION free_text (Note:description):

Tissue section of human prostate containing adenocarcinoma that has been immunostained for the cell-surface antigen BXP34. Nuclei are stained in blue. This image is part of a large collection of images generated from numerous specimens to characterize the distribution of BXP34 in human prostate tissue. A summary of the entire data set is provided below. No summary is available for BXP34 immunostain of human prostate.

\
\
This image is part of a large collection of immunohistochemistry images of cell-surface antigens generated by the SCGAP Urologic Epithelial Stem Cells (UESC) Project. The overall goal of the project is to characterize and isolate epithelial stem cell populations from two urologic organs, the prostate and bladder. Links are provided below for the UESC Project database, the entire human prostate immunostain summary, the BXP34 immunostain summary, and information on the specimen that this image is from. Other images of BXP34 human prostate immunostains are accessible following the group link.

Another example is 32178: \
\

Cell Type % intense % equivocal % none # assays
\ \

\

[backslashes where added to the text to preserve how the text appears in the spreadsheet]

There are a bunch of Note:descriptions in the 32167-33147 range of objects that have this issue.

Q: I think we just preserve the format of the text in JSON source. Please give mapping instructions if special handling is needed.

Note:technical details Section

Q: Could we reorganize issues in this section regarding how Note:technical details should be mapped in format like:

Source / CCDB field | example object | problem | expected result | mapping instructions

There are too many fields listed for Note:technical details, which is confusing. With the mapping for Note:technical details in row#45 and row#46, how one Note:technical details is constructed with so many Source / CCDB fields listed below? That is, what Source / CCDB fields should be mapped to one Note:technical details element and how the note value is constructed.

  1. All of the fields below should map to a single Note:technical details column, but currently only PREPARATION and RELATIONTOINTACTCELL. Please see "CIL Processing and Mapping Instructions" document, cell 45D.

CIL_CCDB.CIL.CORE.PREPARATION.onto_name CIL_CCDB.CIL.CORE.PREPARATION.free_text CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.onto_name CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.free_text CIL_CCDB.CIL.CORE.ITEMTYPE.onto_name CIL_CCDB.CIL.CORE.ITEMTYPE.free_text CIL_CCDB.CIL.CORE.IMAGINGMODE.onto_name CIL_CCDB.CIL.CORE.IMAGINGMODE.free_text CIL_CCDB.CIL.CORE.PARAMETERIMAGED.onto_name CIL_CCDB.CIL.CORE.PARAMETERIMAGED.free_text CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.onto_name CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.free_text CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.onto_name CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.free_text CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.onto_name CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.free_text CIL_CCDB.CIL.CORE.DATAQUALIFICATION.onto_name CIL_CCDB.CIL.CORE.DATAQUALIFICATION.free_text

  1. Note:technical details, Relation to intact cell - where PREPARATION field contains more than one term, RELATIONTOINTACTCELL is repeated.

Example objects: 742, 743, 745 "Preparation: glutaraldehyde fixed tissue Relation to intact cell: isolated subcellular component"
"Preparation: unembedded tissue
Relation to intact cell: isolated subcellular component"

  1. Whenever terms are mapped to Note:technical details and there is > 1 term, separate each term with a semicolon.

Examples: 33147, 36274, 7825

For instance, 33147 returns: Imaging mode: bright-field microscopy Imaging mode: widefield illumination

---Should read "Imaging mode: bright-field microscopy; widefield illumination"

33147 returns: Source of contrast: differences in adsorption or binding of stain Source of contrast: differences in intrinsic optical density Source of contrast: distribution of a specific protein

--Should read: "Source of contrast: differences in adsorption or binding of stain; differences in intrinsic optical density; distribution of a specific protein"

_Note: technical details issues related to the "freetext" tag

  1. Note: technical details, SOURCEOFCONTRAST missing from some objects where the content appears after a "free_text" tag.

Examples: object 2, 194, 238, 239, 7118

9067 total SOURCEOFCONTRAST elements in .json files; Excel found 3146 instances

  1. Note:technical details, PREPARATION missing from some objects where the content appears after a "free_text" tag.

Examples (only found 3 cases): 12620 {"free_text": "dehydrated in ethanol"} 140 {"free_text": "in vitro assembly"} 36812 {"free_text": "in vitro assembly"}"

  1. Note:technical details, PROCESSINGHISTORY missing from some objects where the content appears after a "free_text" tag.

Examples: 120 - 125: {"free_text": "unprocessed raw image"} 10008 {"free_text": "Print from negative scanned for Photoshop."}

  1. Note:technical details, PARAMETERIMAGED missing content ("PARAMETERIMAGED": {"free_text": "specimen height"},) -- notice that the term appears in "free_text" in this instance, whereas in object 40596, the term appears with under "onto_name" and with an "onto_id": "PARAMETERIMAGED": {"onto_name": "elastic scattering of electrons","onto_id": "FBbi:00000586"}

Examples where the term is not returned: Objects 7101 - 7123

  1. Note:technical details - DATAQUALIFICATION - data not present in spreadsheet; all of the entries in this field are associated with a "free_text" tag.
lsitu commented 4 years ago

@abbypenn93 I see that the special cases/inconsistent data patterns is the root cause of incomplete metadata records in item#13: some values are JSONObject, while others are JSONArray for the same field, which triggers parsing exception that interrupts the conversion process for the records. I am moving forward to fix it now. Also, the item#9 has the wrong path CIL_CCCDB.CIL.Citation.Titlein row#52 of the "CIL Processing and Mapping Instructions" document , which should be CIL_CCCDB.Citation.Title:

9 . Note:preferred citation - no column in output file (Note that the content is the same as in the Title column [CIL_CCCDB.Citation.Title])

Could you clarify the mappings in Note:technical details Section (item#20 - item#27)? I see the mapping instructions in cell 45D and cell 46D is confusing and that may cause the conversion problems for Note:technical details Section as in item#20:

20. All of the fields below should map to a single Note:technical details column, but currently only PREPARATION and RELATIONTOINTACTCELL. Please see "CIL Processing and Mapping Instructions" document, cell 45D.
CIL_CCDB.CIL.CORE.PREPARATION.onto_name
CIL_CCDB.CIL.CORE.PREPARATION.free_text
CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.onto_name
CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.free_text
CIL_CCDB.CIL.CORE.ITEMTYPE.onto_name
CIL_CCDB.CIL.CORE.ITEMTYPE.free_text
CIL_CCDB.CIL.CORE.IMAGINGMODE.onto_name
CIL_CCDB.CIL.CORE.IMAGINGMODE.free_text
CIL_CCDB.CIL.CORE.PARAMETERIMAGED.onto_name
CIL_CCDB.CIL.CORE.PARAMETERIMAGED.free_text
CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.onto_name
CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.free_text
CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.onto_name
CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.free_text
CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.onto_name
CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.free_text
CIL_CCDB.CIL.CORE.DATAQUALIFICATION.onto_name
CIL_CCDB.CIL.CORE.DATAQUALIFICATION.free_text

Thank you.

abbypenn93 commented 4 years ago

Hi Longshou. Thanks for all your hard work. I'm currently wrapping up some other projects, but I should be able to respond to your questions later today.

lsitu commented 4 years ago

@abbypenn93 Could you give me more instructions for Note:technical details section mapping? Thanks.

abbypenn93 commented 4 years ago

Hi Longshou. Below is my response to your questions. Please let me know if anything needs clarification.

(Only includes questions that required a response)

  1. Subject heading ingest file not generated (step 4.a. of the harvest process):

Status: Fixed - Subject file is present.

Open - in addition to the term (from onto_name), the onto_id should be placed in the "closeMatch" column of subject heading document.

Q: The term closeMatch doesn't have its corresponding mapping in the Excel Import Stream tool. I think it will be a problem if subjects with the same subject term but different closeMatch values. We can manipulate to export it to the subject headings output and see how it goes.

A: I was just trying to simplify the process, but no matter :) There’s a sheet in the CIL Processing and Mapping document called “(Subject) Heading ingest file.” This sheet shows what should appear in each column of the Subject head ingest file. There’s also an example of expected result on row 31 of this sheet.

  1. TERMSANDCONDITIONS - no copyright note column in output file.

Status: Open

Q: The mapping of the TERMSANDCONDITIONS to copyright note is not a valid header in Excel Import Stream tool, nor a valid header in the new Batch Import/Export tools. I think that's why it's missing. Should we use the term copyrightNote as what's used in Batch Import/Export tools?

A: Good point. The column heading is named “Copyright status” in the ingest template, let’s go with that.

New issues (based on 2019 July 3 harvest)

  1. Please confirm that the harvester is looping through the directory in git repo to see if IDs already exist in the dams.

Q: We are no longer use the git repo but the CIL public API to download JSON sources. All JSON files downloaded should be processed. If you see anything is missing, please bring it up and we can see what's the issue.

A: That’s great. How can I check to see if anything is missing?

  1. Some objects are missing titles and components.

Note that this is different from Item 12. For these objects: -- zips, tifs,and jpg files are present -- CIL_CCDB.Citation.Title (citations) are present in the json

Examples: 49451 49453 49651 49701 49751 49752 49753 49754 49755 49756 49757 49758 49759"

Q: Hmm, something weird that may need to inspect. Not sure whether it's related to the format of the JSON source or not.

A: Ok - See Longshou’s Aug. 5 comment

  1. There are 2 Date:creation columns (one is empty)

Q: May be this is caused by the inconsistent of the JSON source for Date:creation mapping. We'll see how to eliminate the empty column. I see some records have one date value, while others have several date values that may cause the problem.

A: Ok

  1. Person:researcher - name missing where > 1 name listed in the source file --when this occurs, only the last name on the list is returned --some objects are correct, for example 7105-7108, where 2 of 2 researchers are returned

Q: It looks like the source format are different here. Object 7105 actually has only one value for two researchers. How can we deal with it?

A: General Processing for the Contributor Field All: return name or names within brackets (see instructions for special cases, below)

Replace semicolons and commas with pipes. This step will work except in the rare case where the last name is presented first (DOMM will process these separately).

Special Cases

Please refer to the new sheet in the CIL Processing and Mapping Instructions document called “Contributor name processing.” This sheet provides instructions on handling the various contributor name patterns. --See sheet rows: 2, 4, 7, 17, 23, 53, 56, 464, 622, 738 for examples of the various patterns --Please let me know if you discover additional unique patterns

Where contributor name is coupled with the text “YYYY Olympus BioScapes Digital Imaging Competition®” (e.g. 2009 Olympus BioScapes Digital Imaging Competition®) remove “YYYY Olympus BioScapes Digital Imaging Competition®”.

Issue:

"ATTRIBUTION": { "Contributors": [ "Luda Shlyaktenko, Chris Woodcock" ],

Example objects with missing researcher names: 2, 2592, 1030, 1031, 1032, 1033

Object 2: One of two researchers missing (transformation returned Trudy Aebig; json reads "ATTRIBUTION": {
                    "Contributors": [
                        "Linda Parysek",
                        "Trudy Aebig"

**Q:** _Will fix it._

17. Type of Resource appears in two columns (has been split on pipe) but should appear in one. All object rows should read "data | still image"     

**Q:** _I think this is correct since all values are splitting on pipe fro single ones. I think it's not a good idea to do custom handling of `Type of Resource` to make it into one column unless we see the benefits of doing that._

**A:** Makes sense. No action required.

19. HTML elements are not being processed. Will these records display properly in the DAMS?

For example, in object 3216, IMAGEDESCRIPTION free_text (Note:description):

Tissue section of human prostate containing adenocarcinoma that has been immunostained for the cell-surface antigen BXP34. Nuclei are stained in blue. This image is part of a large collection of images generated from numerous specimens to characterize the distribution of BXP34 in human prostate tissue. A summary of the entire data set is provided below. No summary is available for BXP34 immunostain of human prostate.

\<br />
\<br />
This image is part of a large collection of immunohistochemistry images of cell-surface antigens generated by the SCGAP Urologic Epithelial Stem Cells (UESC) Project. The overall goal of the project is to characterize and isolate epithelial stem cell populations from two urologic organs, the prostate and bladder. Links are provided below for the UESC Project database, the entire human prostate immunostain summary, the BXP34 immunostain summary, and information on the specimen that this image is from. Other images of BXP34 human prostate immunostains are accessible following the group link.

Another example is 32178:
\<br />
\<table border="1px">
\<tr>
\<th>Cell Type</th>
<th>% intense</th>
<th>% equivocal</th>
<th>% none</th>
<th># assays</th>
\</tr>

[_backslashes where added to the text to preserve how the text appears in the spreadsheet_]

There are a bunch of Note:descriptions in the 32167-33147 range of objects that have this issue.    

**Q:** _I think we just preserve the format of the text in JSON source. Please give mapping instructions if special handling is needed._

**A:** Is this because the html be processed correctly by the DAMS UI? 

_**Note:technical details Section**_

**Q:** _Could we reorganize issues in this section regarding how `Note:technical details` should be mapped in format like:_

Source / CCDB field | example object | problem | expected result | mapping instructions



_There are too many fields listed for `Note:technical details`, which is confusing. With the mapping for `Note:technical details` in row#45 and row#46, how one `Note:technical details` is constructed with so many `Source / CCDB field`s listed below? That is, what `Source / CCDB field`s should be mapped to one `Note:technical details` element and how the note value is constructed. _ 

Yes, the notes can be confusing! 

See the new sheet in the CIL Processing and Mapping Instructions doc labeled “Technical details processing” with examples and processing instructions.

Hopefully this will help. If not, please let me know.

20. All of the fields below should map to a single Note:technical details column, but currently only PREPARATION and RELATIONTOINTACTCELL. Please see "CIL Processing and Mapping Instructions" document, cell 45D.

CIL_CCDB.CIL.CORE.PREPARATION.onto_name
CIL_CCDB.CIL.CORE.PREPARATION.free_text
CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.onto_name
CIL_CCDB.CIL.CORE.RELATIONTOINTACTCELL.free_text
CIL_CCDB.CIL.CORE.ITEMTYPE.onto_name
CIL_CCDB.CIL.CORE.ITEMTYPE.free_text
CIL_CCDB.CIL.CORE.IMAGINGMODE.onto_name
CIL_CCDB.CIL.CORE.IMAGINGMODE.free_text 
CIL_CCDB.CIL.CORE.PARAMETERIMAGED.onto_name
CIL_CCDB.CIL.CORE.PARAMETERIMAGED.free_text
CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.onto_name
CIL_CCDB.CIL.CORE.SOURCEOFCONTRAST.free_text
CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.onto_name
CIL_CCDB.CIL.CORE.VISUALIZATIONMETHODS.free_text
CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.onto_name
CIL_CCDB.CIL.CORE.PROCESSINGHISTORY.free_text
CIL_CCDB.CIL.CORE.DATAQUALIFICATION.onto_name
CIL_CCDB.CIL.CORE.DATAQUALIFICATION.free_text

**Q:** Could you clarify the mappings in Note:technical details Section (item#20 - item#27)? I see the mapping instructions in cell 45D and cell 46D is confusing and that may cause the conversion problems for Note:technical details Section as in item#20:

**A:** This should be addressed in the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet

21. Note:technical details, Relation to intact cell - where PREPARATION field contains more than one term, RELATIONTOINTACTCELL is repeated. 

Example objects: 742, 743, 745
"Preparation: glutaraldehyde fixed tissue
Relation to intact cell: isolated subcellular component"        
"Preparation: unembedded tissue    
Relation to intact cell: isolated subcellular component"    

**A:** This should be addressed in the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 

22. Whenever terms are mapped to Note:technical details and there is > 1 term, separate each term with a semicolon.

Examples: 33147, 36274, 7825

For instance, 33147 returns:
Imaging mode: bright-field microscopy 
Imaging mode: widefield illumination

---Should read "Imaging mode: bright-field microscopy; widefield illumination"

33147 returns:
Source of contrast: differences in adsorption or binding of stain
Source of contrast: differences in intrinsic optical density
Source of contrast: distribution of a specific protein

--Should read: "Source of contrast: differences in adsorption or binding of stain; differences in intrinsic optical density; distribution of a specific protein"        

_Note: technical details issues related to the "free_text" tag_     

**A:** See the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 

23. Note: technical details, SOURCEOFCONTRAST missing from some objects where the content appears after a "free_text" tag.

Examples: object 2, 194, 238, 239, 7118

9067 total SOURCEOFCONTRAST elements in .json files; Excel found 3146 instances     

**A:** See the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 

24. Note:technical details, PREPARATION missing from some objects where the content appears after a "free_text" tag.

Examples (only found 3 cases):
12620  {"free_text": "dehydrated in ethanol"}
140 {"free_text": "in vitro assembly"}
36812 {"free_text": "in vitro assembly"}"       

**A:** See the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 

25. Note:technical details, PROCESSINGHISTORY missing from some objects where the content appears after a "free_text" tag. 

Examples: 
120 - 125: {"free_text": "unprocessed raw image"} 
10008 {"free_text": "Print from negative scanned for Photoshop."}   

**A:** See the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 

26. Note:technical details, PARAMETERIMAGED missing content ("PARAMETERIMAGED": {"free_text": "specimen height"},) -- notice that the term appears in "free_text" in this instance, whereas in object 40596, the term appears with under "onto_name" and with an "onto_id":
"PARAMETERIMAGED": {"onto_name": "elastic scattering of electrons","onto_id": "FBbi:00000586"}

Examples where the term is not returned:
Objects 7101 - 7123             

27. Note:technical details - DATAQUALIFICATION - data not present in spreadsheet; all of the entries in this field are associated with a "free_text" tag.

**A:** See the CIL Processing and Mapping Instructions doc labeled “Technical details processing” sheet 
lsitu commented 4 years ago

@abbypenn93 Thanks. From the “Technical details processing” sheet, I can see how Note:technical details should be mapped now. For 19. HTML elements are not being processed., Yes, damspas will process it in some way, but I think we may need to test it to see how it goes.

lsitu commented 4 years ago

@abbypenn93 In sheet "Contributor name processing" Row#4, in see value ignore in column ITS Dev action. Do you mean just to ignore all names with last name first like ['woodcock, christopher'] ? That is return no name value for any names with last name first (with pattern delimited by comma like 'woodcock, christopher').

abbypenn93 commented 4 years ago

Please return all names, even if last name is presented first.

I didn't specify this but if you could detect that a single person's name is enclosed brackets and separated by a comma in the json, it can be left unprocessed. In this most recent harvest, 'woodcock, christopher' is the only name presented last name first and most names in the Cell Image Library database are first name last.

If you feel it's not worth the time to implement this check, please let me know and I will clarify my original recommendation.

Thanks, Abby

lsitu commented 4 years ago

@abbypenn93 I see it more complicated so I think we should keep this name pattern unless we know there will be no such cases in feature. For example, Row#56 has the following value: ['Buchanan, JoAnn (Stanford) (specimen prep)', 'Genoud, Christel (Gatan) (imaging)']

abbypenn93 commented 4 years ago

Ah, yes! Then I agree, just replace commas and semicolons with pipes and DOMM will take care of the oddballs after the harvest.

lsitu commented 4 years ago

@arwenhutt / @abbypenn93 I almost done with coding for the changes we need. But there are something wrong with the public CIL API and it returns a JSON source file instead of the CIL ID's while querying CIL IDs with https://cilia.crbs.ucsd.edu/rest/public_ids?from=0&size=10&lastModified=1560150000. And I can't test it. It looks like @hjsyoo is OoO at this time and won't be back by August 20th. Who should report this kind of issues for public CIL API support? Thanks.

Here is the result I got:

{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":9688,"max_score":0.4948447,"hits":[{"_index":"ccdbv8","_type":"data","_id":"CIL_1008","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1317009600},"CIL":{"Image_files":[{"Mime_type":"application\/zip","File_type":"Zip","File_path":"1008.zip","Size":2880870},{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"1008.tif","Size":3900000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"1008.jpg","Size":248011}],"CORE":{"GROUP_ID":"9511","IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":{"onto_name":"actin filament","onto_id":"GO:0005884"},"BIOLOGICALPROCESS":{"onto_name":"chronological cell aging","onto_id":"GO:0001300"},"SOURCEOFCONTRAST":{"onto_name":"distribution of epitope","onto_id":"FBbi:00000592"},"RELATIONTOINTACTCELL":{"onto_name":"whole mounted tissue","onto_id":"FBbi:00000024"},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Caenorhabditis elegans","onto_id":"NCBITaxon:6239"},"TECHNICALDETAILS":{"free_text":"C. elegans muscle age is a data set of fluorescence 20X microscopy images of C. elegans nematodes stained with phalloidin to visualize actin in muscles at different ages (1,2,4, and 8).  Note that the data sets for each age can be found by searching for a featured image from each set: CIL 1005, 1057, 1105, and 1265, respectively."},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1600}},{"Space":{"axis":"Y","Image_size":1200}}],"VISUALIZATIONMETHODS":{"onto_name":"phalloidin","onto_id":"FBbi:00000100"},"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"CELLTYPE":{"onto_name":"muscle cell","onto_id":"CL:0000187"},"IMAGEDESCRIPTION":{"free_text":"The purpose of the dataset C. elegans muscle aging is to deduce the age of the nemathode based on images of muscles.  Images of C. elegans were taken at different chronological ages. This image is part of the day 1 data set. Note that the morphological and chronological ages are not fully correlated due to the variability among individuals even though the individuals are genetically identical. The source for the dataset is Laboratory of Genetics\/NIA\/NIH."},"ATTRIBUTION":{"URLs":[{"Label":"NIH\/NIA Laboratory of Genetics","Href":"http:\/\/ome.grc.nia.nih.gov\/iicbu2008\/celegans\/"}],"Contributors":["Nikita Orlov","Wendy Iser","Cathy Wolkow"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL1008","ARK":"ark:\/b7295\/w9cil1008","Title":"Nikita Orlov, Wendy Iser, Cathy Wolkow (2011) CIL:1008, Caenorhabditis elegans, muscle cell. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10081","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1292562000},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10081.tif","Size":20300000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10081.jpg","Size":3033618},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10081.zip","Size":20497716}],"CORE":{"PREPARATION":[{"onto_name":"formaldehyde fixed tissue","onto_id":"FBbi:00000010"},{"onto_name":"critical_point dried specimen","onto_id":"FBbi:00000581"}],"GROUP_ID":"2858","IMAGINGMODE":{"onto_name":"transmission electron microscopy (TEM)","onto_id":"FBbi:00000258"},"CELLULARCOMPONENT":{"onto_name":"nuclear chromatin","onto_id":"GO:0000790"},"RELATIONTOINTACTCELL":{"free_text":"spread preparation"},"SOURCEOFCONTRAST":{"onto_name":"differences in deposition of metal shadow","onto_id":"FBbi:00000601"},"NCBIORGANISMALCLASSIFICATION":{"free_text":"Notopthalmus viridescence"},"PARAMETERIMAGED":{"onto_name":"electron density","onto_id":"FBbi:00000315"},"DIMENSION":[{"Space":{"axis":"X","Image_size":3696,"Pixel_size":{"unit":"nanometers","value":1}}},{"Space":{"axis":"Y","Image_size":2742,"Pixel_size":{"unit":"nanometers","value":1}}}],"VISUALIZATIONMETHODS":{"onto_name":"shadowing and plating","onto_id":"FBbi:00000398"},"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"CELLTYPE":{"onto_name":"erythrocyte","onto_id":"CL:0000232"},"IMAGEDESCRIPTION":{"free_text":"Nucleated erythrocytes from the newt Notopthahmus viridescens were spread on water, the dispersed chromatin picked up on carbon-formvar grids, fixed with paraformaldehye, critical point dried and shadowed with platinum.  Images were obtained with the Wisconsin high voltage TEM at 1MEV.  For this micrograph, the grid was tilted to 55 degrees. A similar micrograph tilted to 45 degrees providing a stereo pair with an oblique 3D view of the tangled and irregular chromatin fibers is grouped with this image."},"ATTRIBUTION":{"Contributors":["Hans Ris"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":{"onto_name":"charge coupled device (CCD)","onto_id":"FBbi:00000294"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10081","ARK":"ark:\/b7295\/w9cil10081","Title":"Hans Ris (2010) CIL:10081, Notopthalmus viridescence, erythrocyte. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10094","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1291611600},"CIL":{"Image_files":[{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10094.zip","Size":8039424},{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10094.tif","Size":8100000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10094.jpg","Size":34663}],"CORE":{"PREPARATION":[{"onto_name":"formaldehyde fixed tissue","onto_id":"FBbi:00000010"},{"onto_name":"detergent permeabilized","onto_id":"FBbi:00000262"}],"GROUP_ID":"14451","IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":[{"onto_name":"cytoskeleton","onto_id":"GO:0005856"},{"onto_name":"microtubule cytoskeleton","onto_id":"GO:0015630"},{"onto_name":"actin cytoskeleton","onto_id":"GO:0015629"},{"onto_name":"axon","onto_id":"GO:0030424"},{"onto_name":"dendrite","onto_id":"GO:0030425"},{"onto_name":"dendritic growth cone","onto_id":"GO:0044294"},{"onto_name":"axonal growth cone","onto_id":"GO:0044295"},{"onto_name":"lamellipodium","onto_id":"GO:0030027"},{"onto_name":"filopodium","onto_id":"GO:0030175"}],"BIOLOGICALPROCESS":[{"onto_name":"developmental process","onto_id":"GO:0032502"},{"onto_name":"dendrite development","onto_id":"GO:0016358"},{"onto_name":"establishment or maintenance of cell polarity","onto_id":"GO:0007163"}],"RELATIONTOINTACTCELL":{"onto_name":"dispersed cells in vitro","onto_id":"FBbi:00000611"},"SOURCEOFCONTRAST":[{"onto_name":"distribution of epitope","onto_id":"FBbi:00000592"},{"onto_name":"differences in adsorption or binding of stain","onto_id":"FBbi:00000598"}],"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Rattus","onto_id":"NCBITaxon:10114"},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1300,"Pixel_size":{"unit":"microns","value":0.339}}},{"Space":{"axis":"Y","Image_size":1030,"Pixel_size":{"unit":"microns","value":0.339}}}],"VISUALIZATIONMETHODS":[{"onto_name":"phalloidin","onto_id":"FBbi:00000100"},{"onto_name":"primary antibody plus labeled secondary antibody","onto_id":"FBbi:00000156"}],"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"IMAGEDESCRIPTION":{"free_text":"This multi-layer image shows the spatial relationship between filamentous actin (red) and microtubule array (green) in cultured hippocampal neurons, grown for 1 day in vitro.  Actin staining (with rhodamine phalloidin) highlights the growing tips and filopodial extensions along axons and dendrites, while microtubule staining reveals the stable shafts of these processes.  Some nonneuronal cells may also appear in the field.\nDetailed Methods: Embryonic rat hippocampal neurons were prepared as previously described (see Kaech and Banker, 2006, Nat Protoc).  Cells were prepared for fluorescent staining as previously described (Withers and Banker, 1998, in Culturing Nerve Cells, MIT Press).  Briefly, cells were fixed (4% formaldehyde, 4% sucrose in phosphate buffered saline, pH 7.4,  37\u00b0C, 15 minutes), permeabilized (0.25% Triton, 7 minutes) and immunostained for tubulin (monoclonal DM1A, Sigma, with Alexa 488 conjugated secondary, Molecular Probes, excitation, 494, emission, 519) and rhodamine-conjugated phalloidin (Molecular Probes, excitation, 540, emission, 565).  Fluorescent and phase images were acquired with a Leica DMRA microscope with a mercury arc lamp, a 20X lens (HC PL Fluotar, NA 0.5), Leica GFP filter set (excitation, BP 470\/40; dichromatic mirror, 500, suppression filter, BP 525\/50); Leica N3 filter set (excitation, BP546\/12; dichromatic mirror, 565, suppression filter, BP 600\/40), Photometrics CoolSnap ES CCD camera and MetaMorph software."},"CELLTYPE":{"onto_name":"multipolar neuron","onto_id":"CL:0000104"},"ATTRIBUTION":{"Contributors":["Dieter Brandner; Ginger Withers"]},"TERMSANDCONDITIONS":{"free_text":"attribution_cc_by"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10094","ARK":"ark:\/b7295\/w9cil10094","Title":"Dieter Brandner; Ginger Withers (2010) CIL:10094, Rattus, multipolar neuron. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_1010","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1317009600},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"1010.tif","Size":3900000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"1010.jpg","Size":123380},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"1010.zip","Size":2880870}],"CORE":{"GROUP_ID":"9511","IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":{"onto_name":"actin filament","onto_id":"GO:0005884"},"BIOLOGICALPROCESS":{"onto_name":"chronological cell aging","onto_id":"GO:0001300"},"SOURCEOFCONTRAST":{"onto_name":"distribution of epitope","onto_id":"FBbi:00000592"},"RELATIONTOINTACTCELL":{"onto_name":"whole mounted tissue","onto_id":"FBbi:00000024"},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Caenorhabditis elegans","onto_id":"NCBITaxon:6239"},"TECHNICALDETAILS":{"free_text":"C. elegans muscle age is a data set of fluorescence 20X microscopy images of C. elegans nematodes stained with phalloidin to visualize actin in muscles at different ages (1,2,4, and 8).  Note that the data sets for each age can be found by searching for a featured image from each set: CIL 1005, 1057, 1105, and 1265, respectively."},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1600}},{"Space":{"axis":"Y","Image_size":1200}}],"VISUALIZATIONMETHODS":{"onto_name":"phalloidin","onto_id":"FBbi:00000100"},"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"CELLTYPE":{"onto_name":"muscle cell","onto_id":"CL:0000187"},"IMAGEDESCRIPTION":{"free_text":"The purpose of the dataset C. elegans muscle aging is to deduce the age of the nemathode based on images of muscles.  Images of C. elegans were taken at different chronological ages. This image is part of the day 1 data set. Note that the morphological and chronological ages are not fully correlated due to the variability among individuals even though the individuals are genetically identical. The source for the dataset is Laboratory of Genetics\/NIA\/NIH."},"ATTRIBUTION":{"URLs":[{"Label":"NIH\/NIA Laboratory of Genetics","Href":"http:\/\/ome.grc.nia.nih.gov\/iicbu2008\/celegans\/"}],"Contributors":["Nikita Orlov","Wendy Iser","Cathy Wolkow"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL1010","ARK":"ark:\/b7295\/w9cil1010","Title":"Nikita Orlov, Wendy Iser, Cathy Wolkow (2011) CIL:1010, Caenorhabditis elegans, muscle cell. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10102","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1292734800},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10102.tif","Size":4200000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10102.jpg","Size":57462},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10102.zip","Size":4133755}],"CORE":{"PREPARATION":[{"onto_name":"formaldehyde fixed tissue","onto_id":"FBbi:00000010"},{"onto_name":"permeabilized tissue","onto_id":"FBbi:00000093"}],"IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":[{"onto_name":"clathrin coat","onto_id":"GO:0030118"},{"onto_name":"nucleus","onto_id":"GO:0005634"}],"RELATIONTOINTACTCELL":{"onto_name":"dispersed cells in vitro","onto_id":"FBbi:00000611"},"SOURCEOFCONTRAST":{"onto_name":"distribution of a specific protein","onto_id":"FBbi:00000597"},"CELLLINE":{"onto_name":"ARPE-19","onto_id":"MCC:0000038"},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Homo sapiens","onto_id":"NCBITaxon:9606"},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1344,"Pixel_size":{"unit":"nanometers","value":67}}},{"Space":{"axis":"Y","Image_size":1024,"Pixel_size":{"unit":"nanometers","value":67}}},{"Wavelength":{"unit":"nanometers","value":"350, 488"}}],"VISUALIZATIONMETHODS":[{"onto_name":"4',6-diamidino-2-phenylindole (DAPI)","onto_id":"FBbi:00000056"},{"onto_name":"primary antibody plus labeled secondary antibody","onto_id":"FBbi:00000156"},{"onto_name":"Fluorescein (FITC)","onto_id":"FBbi:00000451"}],"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"IMAGEDESCRIPTION":{"free_text":"Cultured retinal pigment epithelial cells immunofluorescently labeled for clathrin (green) and nucleus (blue).  The cells were fixed in 2% PFA and 0.5% Triton X-100 for 2 minutes followed by post-fixation 4% PFA.  Clathrin was detected with X22 primary antibody and secondary FITC antibody.  The nucleus was detected with  DAPI staining. Images were collected on an  Olympus IX-71 epifluorescence microscope using a 100X 1.4 NA objective with 4.500ms exposure for clathrin and 50ms exposure for DAPI (67nm\/pixel)."},"CELLTYPE":{"onto_name":"epithelial cell","onto_id":"CL:0000066"},"ATTRIBUTION":{"Contributors":["Allen Liu","Sandra L. Schmid"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10102","ARK":"ark:\/b7295\/w9cil10102","Title":"Allen Liu, Sandra L. Schmid (2010) CIL:10102, Homo sapiens, epithelial cell. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10103","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1292734800},"CIL":{"Image_files":[{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10103.zip","Size":4133755},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10103.jpg","Size":85975},{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10103.tif","Size":4200000}],"CORE":{"PREPARATION":[{"onto_name":"formaldehyde fixed tissue","onto_id":"FBbi:00000010"},{"onto_name":"permeabilized tissue","onto_id":"FBbi:00000093"}],"IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":[{"onto_name":"AP-2 adaptor complex","onto_id":"GO:0030122"},{"onto_name":"nucleus","onto_id":"GO:0005634"}],"RELATIONTOINTACTCELL":{"onto_name":"dispersed cells in vitro","onto_id":"FBbi:00000611"},"SOURCEOFCONTRAST":{"onto_name":"distribution of a specific protein","onto_id":"FBbi:00000597"},"CELLLINE":{"onto_name":"ARPE-19","onto_id":"MCC:0000038"},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Homo sapiens","onto_id":"NCBITaxon:9606"},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1344,"Pixel_size":{"unit":"nanometers","value":67}}},{"Space":{"axis":"Y","Image_size":1024,"Pixel_size":{"unit":"nanometers","value":67}}},{"Wavelength":{"unit":"nanometers","value":"350, 488"}}],"VISUALIZATIONMETHODS":[{"onto_name":"4',6-diamidino-2-phenylindole (DAPI)","onto_id":"FBbi:00000056"},{"onto_name":"primary antibody plus labeled secondary antibody","onto_id":"FBbi:00000156"},{"onto_name":"Fluorescein (FITC)","onto_id":"FBbi:00000451"}],"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"IMAGEDESCRIPTION":{"free_text":"Cultured retinal pigment epithelial cells immunofluorescently labeled for adaptor protein-2 (AP2) (green) and nucleus (blue).  The cells were fixed in 2% PFA and 0.5% Triton X-100 for 2 minutes followed by post-fixation 4% PFA.  AP2 was detected with AP-6 primary antibody and secondary FITC antibody.  The nucleus was detected with  DAPI staining. Images were collected on an  Olympus IX-71 epifluorescence microscope using a 100X 1.4 NA objective with 4.500ms exposure for AP2 and 50ms exposure for DAPI (67nm\/pixel)."},"CELLTYPE":{"onto_name":"epithelial cell","onto_id":"CL:0000066"},"ATTRIBUTION":{"Contributors":["Allen Liu","Sandra L. Schmid"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10103","ARK":"ark:\/b7295\/w9cil10103","Title":"Allen Liu, Sandra L. Schmid (2010) CIL:10103, Homo sapiens, epithelial cell. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10110","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1291611600},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10110.tif","Size":8100000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10110.jpg","Size":76550},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10110.zip","Size":8039424}],"CORE":{"PREPARATION":[{"onto_name":"formaldehyde fixed tissue","onto_id":"FBbi:00000010"},{"onto_name":"detergent permeabilized","onto_id":"FBbi:00000262"}],"GROUP_ID":"14451","IMAGINGMODE":{"onto_name":"fluorescence microscopy","onto_id":"FBbi:00000246"},"CELLULARCOMPONENT":[{"onto_name":"cytoskeleton","onto_id":"GO:0005856"},{"onto_name":"microtubule cytoskeleton","onto_id":"GO:0015630"},{"onto_name":"actin cytoskeleton","onto_id":"GO:0015629"},{"onto_name":"axon","onto_id":"GO:0030424"},{"onto_name":"dendrite","onto_id":"GO:0030425"},{"onto_name":"dendritic growth cone","onto_id":"GO:0044294"},{"onto_name":"axonal growth cone","onto_id":"GO:0044295"},{"onto_name":"lamellipodium","onto_id":"GO:0030027"},{"onto_name":"filopodium","onto_id":"GO:0030175"}],"BIOLOGICALPROCESS":[{"onto_name":"developmental process","onto_id":"GO:0032502"},{"onto_name":"dendrite development","onto_id":"GO:0016358"},{"onto_name":"establishment or maintenance of cell polarity","onto_id":"GO:0007163"}],"RELATIONTOINTACTCELL":{"onto_name":"dispersed cells in vitro","onto_id":"FBbi:00000611"},"SOURCEOFCONTRAST":[{"onto_name":"distribution of epitope","onto_id":"FBbi:00000592"},{"onto_name":"differences in adsorption or binding of stain","onto_id":"FBbi:00000598"}],"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Rattus","onto_id":"NCBITaxon:10114"},"PARAMETERIMAGED":{"onto_name":"fluorescence emission","onto_id":"FBbi:00000316"},"DIMENSION":[{"Space":{"axis":"X","Image_size":1300,"Pixel_size":{"unit":"microns","value":0.339}}},{"Space":{"axis":"Y","Image_size":1030,"Pixel_size":{"unit":"microns","value":0.339}}}],"VISUALIZATIONMETHODS":[{"onto_name":"phalloidin","onto_id":"FBbi:00000100"},{"onto_name":"primary antibody plus labeled secondary antibody","onto_id":"FBbi:00000156"}],"PROCESSINGHISTORY":{"onto_name":"unprocessed raw data","onto_id":"FBbi:00000582"},"IMAGEDESCRIPTION":{"free_text":"This multi-layer image shows the spatial relationship between filamentous actin (red) and microtubule array (green) in cultured hippocampal neurons, grown for 3 days in vitro.  Actin staining (with rhodamine phalloidin) highlights the growing tips and filopodial extensions along axons and dendrites, while microtubule staining reveals the stable shafts of these processes.  Some nonneuronal cells may also appear in the field.\nDetailed Methods: Embryonic rat hippocampal neurons were prepared as previously described (see Kaech and Banker, 2006, Nat Protoc).  Cells were prepared for fluorescent staining as previously described (Withers and Banker, 1998, in Culturing Nerve Cells, MIT Press).  Briefly, cells were fixed (4% formaldehyde, 4% sucrose in phosphate buffered saline, pH 7.4,  37\u00b0C, 15 minutes), permeabilized (0.25% Triton, 7 minutes) and immunostained for tubulin (monoclonal DM1A, Sigma, with Alexa 488 conjugated secondary, Molecular Probes, excitation, 494, emission, 519) and rhodamine-conjugated phalloidin (Molecular Probes, excitation, 540, emission, 565).  Fluorescent and phase images were acquired with a Leica DMRA microscope with a mercury arc lamp, a 20X lens (HC PL Fluotar, NA 0.5), Leica GFP filter set (excitation, BP 470\/40; dichromatic mirror, 500, suppression filter, BP 525\/50); Leica N3 filter set (excitation, BP546\/12; dichromatic mirror, 565, suppression filter, BP 600\/40), Photometrics CoolSnap ES CCD camera and MetaMorph software."},"CELLTYPE":{"onto_name":"multipolar neuron","onto_id":"CL:0000104"},"ATTRIBUTION":{"Contributors":["Dieter Brandner; Ginger Withers"]},"TERMSANDCONDITIONS":{"free_text":"attribution_cc_by"},"ITEMTYPE":{"onto_name":"recorded image","onto_id":"FBbi:00000265"}}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10110","ARK":"ark:\/b7295\/w9cil10110","Title":"Dieter Brandner; Ginger Withers (2010) CIL:10110, Rattus, multipolar neuron. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10005","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1291179600},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10005.tif","Size":11300000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10005.jpg","Size":3182012},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10005.zip","Size":22677762}],"CORE":{"PREPARATION":[{"onto_name":"glutaraldehyde fixed tissue","onto_id":"FBbi:00000011"},{"onto_name":"osmium tetroxide fixed tissue","onto_id":"FBbi:00000012"},{"onto_name":"tissue in epoxy resin embedment","onto_id":"FBbi:00000018"},{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"}],"GROUP_ID":"3382","IMAGINGMODE":[{"onto_name":"detection of electrons","onto_id":"FBbi:00000375"},{"onto_name":"film","onto_id":"FBbi:00000303"}],"CELLULARCOMPONENT":[{"onto_name":"cell cortex","onto_id":"GO:0005938"},{"onto_name":"cortical microtubule cytoskeleton","onto_id":"GO:0030981"},{"free_text":"extrusomes"}],"BIOLOGICALPROCESS":[{"onto_name":"cortical cytoskeleton organization","onto_id":"GO:0030865"},{"onto_name":"cortical microtubule organization","onto_id":"GO:0043622"},{"onto_name":"plasma membrane organization","onto_id":"GO:0007009"}],"RELATIONTOINTACTCELL":{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"},"SOURCEOFCONTRAST":{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},"CELLLINE":{"free_text":"Carolina Biological Supply Company, NC, U.S.A."},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Didinium nasutum","onto_id":"NCBITaxon:5997"},"PARAMETERIMAGED":{"onto_name":"electron density","onto_id":"FBbi:00000315"},"MOLECULARFUNCTION":[{"onto_name":"structural constituent of cytoskeleton","onto_id":"GO:0005200"},{"onto_name":"structural molecule activity","onto_id":"GO:0005198"}],"DIMENSION":[{"Space":{"axis":"X","Image_size":3773}},{"Space":{"axis":"Y","Image_size":2989}}],"VISUALIZATIONMETHODS":[{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},{"onto_name":"osmium tetroxide","onto_id":"FBbi:00000571"},{"onto_name":"uranyl salt","onto_id":"FBbi:00000569"},{"onto_name":"lead salt","onto_id":"FBbi:00000570"}],"PROCESSINGHISTORY":[{"onto_name":"recorded image","onto_id":"FBbi:00000265"},{"onto_name":"film","onto_id":"FBbi:00000303"},{"free_text":"Print from negative scanned to Photoshop."}],"IMAGEDESCRIPTION":{"free_text":"Didinium nasutum. A tangential view of the surface of a non-dividing cell shows several ribbons of microtubules between the alveolar sac and the epiplasm. The thick fibrous layer associated with the epiplasm is evident, and the layer of mitochondria under the epiplasm is wrapped with rough ER. A peroxisome lies nearby and mucocysts and toxicysts are evident. TEM taken on 2\/18\/69 by R. Allen with Philips 300 operating at 60kV. Neg. 20,500X. Bar = 0.5\u00b5m. The negative was printed to paper and the image was scanned to Photoshop. This digitized image is available for qualitative analysis. A raw, unprocessed, high resolution version of this image (CIL:9928) is in the library and available for quantitative analysis. Standard glutaraldehyde fixation followed by osmium tetroxide, dehydrated in alcohol and embedded in an epoxy resin. Microtome sections prepared at approximately 75nm thickness. Additional information available at (http:\/\/www5.pbrc.hawaii.edu\/allen\/)."},"CELLTYPE":[{"onto_name":"cell by organism","onto_id":"CL:0000004"},{"onto_name":"eukaryotic cell","onto_id":"CL:0000255"},{"free_text":"Eukaryotic Protist"},{"free_text":"Ciliated Protist"}],"DATAQUALIFICATION":{"free_text":"PROCESSED;spatialmeasurements"},"ATTRIBUTION":{"URLs":[{"Href":"http:\/\/www5.pbrc.hawaii.edu\/allen\/"}],"Contributors":["Richard Allen"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":[{"onto_name":"transmission electron microscopy (TEM)","onto_id":"FBbi:00000258"},{"onto_name":"illumination by electrons","onto_id":"FBbi:00000273"}]}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10005","ARK":"ark:\/b7295\/w9cil10005","Title":"Richard Allen (2010) CIL:10005, Didinium nasutum, cell by organism, eukaryotic cell, Eukaryotic Protist, Ciliated Protist. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10008","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1291179600},"CIL":{"Image_files":[{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10008.zip","Size":23050754},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10008.jpg","Size":3801859},{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10008.tif","Size":11400000}],"CORE":{"PREPARATION":[{"onto_name":"glutaraldehyde fixed tissue","onto_id":"FBbi:00000011"},{"onto_name":"osmium tetroxide fixed tissue","onto_id":"FBbi:00000012"},{"onto_name":"tissue in epoxy resin embedment","onto_id":"FBbi:00000018"},{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"}],"GROUP_ID":"3411","IMAGINGMODE":[{"onto_name":"detection of electrons","onto_id":"FBbi:00000375"},{"onto_name":"film","onto_id":"FBbi:00000303"}],"CELLULARCOMPONENT":[{"onto_name":"rough endoplasmic reticulum","onto_id":"GO:0005791"},{"onto_name":"lipid particle","onto_id":"GO:0005811"},{"free_text":"extrusome"},{"free_text":"toxicyst"}],"BIOLOGICALPROCESS":[{"onto_name":"cytoplasm organization","onto_id":"GO:0007028"},{"onto_name":"organelle organization","onto_id":"GO:0006996"},{"onto_name":"organelle localization","onto_id":"GO:0051640"},{"free_text":"organelle development"}],"RELATIONTOINTACTCELL":{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"},"SOURCEOFCONTRAST":{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},"CELLLINE":{"free_text":"Carolina Biological Supply Company, NC, U.S.A."},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Didinium nasutum","onto_id":"NCBITaxon:5997"},"PARAMETERIMAGED":{"onto_name":"electron density","onto_id":"FBbi:00000315"},"DIMENSION":[{"Space":{"axis":"X","Image_size":3732}},{"Space":{"axis":"Y","Image_size":3032}}],"VISUALIZATIONMETHODS":[{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},{"onto_name":"osmium tetroxide","onto_id":"FBbi:00000571"},{"onto_name":"uranyl salt","onto_id":"FBbi:00000569"},{"onto_name":"lead salt","onto_id":"FBbi:00000570"}],"PROCESSINGHISTORY":[{"onto_name":"film","onto_id":"FBbi:00000303"},{"free_text":"Print from negative scanned for Photoshop."}],"IMAGEDESCRIPTION":{"free_text":"Toxicysts in the cytoplasm of cross sectioned non-dividing Didinium. Toxicysts (extrusomes) are multilayered cylinders that are discharged from the cytopharynx region during capture of prey such as Paramecium. Lipid is stored as large non-membrane enclosed dense bodies in the cytosol. TEM taken on 2\/18\/69 by R. Allen with Philips 300. Neg. 20,500X. Bar = 0.5\u00b5m. The negative was printed to paper and the image was scanned to Photoshop. This digitized image is available for qualitative analysis. A raw, unprocessed, high resolution version of this image (CIL:9929) is in the library and available for quantitative analysis. Standard glutaraldehyde fixation followed by osmium tetroxide, dehydrated in alcohol and embedded in an epoxy resin. Microtome sections prepared at approximately 75nm thickness. Additional information available at (http:\/\/www5.pbrc.hawaii.edu\/allen\/)."},"CELLTYPE":[{"onto_name":"eukaryotic cell","onto_id":"CL:0000255"},{"free_text":"Eukaryotic Protist"},{"free_text":"Ciliated Protist"}],"ATTRIBUTION":{"URLs":[{"Href":"http:\/\/www5.pbrc.hawaii.edu\/allen\/"}],"Contributors":["Richard Allen"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":[{"onto_name":"transmission electron microscopy (TEM)","onto_id":"FBbi:00000258"},{"onto_name":"illumination by electrons","onto_id":"FBbi:00000273"}]}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10008","ARK":"ark:\/b7295\/w9cil10008","Title":"Richard Allen (2010) CIL:10008, Didinium nasutum, eukaryotic cell, Eukaryotic Protist, Ciliated Protist. CIL. Dataset"}}}},{"_index":"ccdbv8","_type":"data","_id":"CIL_10013","_score":0.4948447,"_source":{"CIL_CCDB":{"Status":{"Deleted":false,"Is_public":true,"Publish_time":1291266000},"CIL":{"Image_files":[{"Mime_type":"image\/tif","File_type":"OME_tif","File_path":"10013.tif","Size":8700000},{"Mime_type":"image\/jpeg; charset=utf-8","File_type":"Jpeg","File_path":"10013.jpg","Size":1886786},{"Mime_type":"application\/zip","File_type":"Zip","File_path":"10013.zip","Size":17022022}],"CORE":{"PREPARATION":[{"onto_name":"glutaraldehyde fixed tissue","onto_id":"FBbi:00000011"},{"onto_name":"osmium tetroxide fixed tissue","onto_id":"FBbi:00000012"},{"onto_name":"tissue in epoxy resin embedment","onto_id":"FBbi:00000018"},{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"}],"GROUP_ID":"3810","IMAGINGMODE":[{"onto_name":"detection of electrons","onto_id":"FBbi:00000375"},{"onto_name":"film","onto_id":"FBbi:00000303"}],"CELLULARCOMPONENT":[{"onto_name":"cell cortex","onto_id":"GO:0005938"},{"onto_name":"cilium","onto_id":"GO:0005929"},{"onto_name":"microtubule basal body","onto_id":"GO:0005932"},{"onto_name":"ciliary rootlet","onto_id":"GO:0035253"},{"free_text":"bacterial ectosymbiont"}],"BIOLOGICALPROCESS":[{"onto_name":"ciliary cell motility","onto_id":"GO:0060285"},{"onto_name":"microtubule cytoskeleton organization","onto_id":"GO:0000226"},{"free_text":"cytosketal organization"},{"onto_name":"detection of symbiotic bacterium","onto_id":"GO:0009604"}],"RELATIONTOINTACTCELL":{"onto_name":"microtome-sectioned tissue","onto_id":"FBbi:00000029"},"SOURCEOFCONTRAST":{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},"CELLLINE":{"free_text":"Carolina Biological Supply Company, NC, U.S.A."},"NCBIORGANISMALCLASSIFICATION":{"onto_name":"Didinium nasutum","onto_id":"NCBITaxon:5997"},"PARAMETERIMAGED":{"onto_name":"electron density","onto_id":"FBbi:00000315"},"DIMENSION":[{"Space":{"axis":"X","Image_size":3271}},{"Space":{"axis":"Y","Image_size":2634}}],"VISUALIZATIONMETHODS":[{"onto_name":"stain with broad specificity","onto_id":"FBbi:00000415"},{"onto_name":"osmium tetroxide","onto_id":"FBbi:00000571"},{"onto_name":"uranyl salt","onto_id":"FBbi:00000569"},{"onto_name":"lead salt","onto_id":"FBbi:00000570"}],"PROCESSINGHISTORY":[{"onto_name":"film","onto_id":"FBbi:00000303"},{"free_text":"Print from negative scanned for Photoshop."}],"IMAGEDESCRIPTION":{"free_text":"An oblique section shows the transverse aspect of basal bodies\/cilia within one pectinelle of one of the two ciliary girdles of Didinium. Two ribbons of microtubules and a short kinetodesmal fiber arise from the proximal margin of each basal body. These may correspond to transverse microtubules and  postciliary microtubules of other ciliates. Tips of parasomal sacs are also present. Microtubular ribbons under the alveoli seem to arise from extensions of the postciliary microtubules. TEM taken on 5\/9\/69 by R. Allen with Philips 300 operating at 60kV. Neg. 14,800X. Bar = 0.5\u00b5m. The negative was printed to paper and the image was scanned to Photoshop. This digitized image is available for qualitative analysis. A raw, unprocessed, high resolution version of this image (CIL:4663) is in the library and available for quantitative analysis. \n\nStandard glutaraldehyde fixation followed by osmium tetroxide, dehydrated in alcohol and embedded in an epoxy resin. Microtome sections prepared at approximately 75nm thickness.\n\nAdditional information available at (http:\/\/www5.pbrc.hawaii.edu\/allen\/)."},"CELLTYPE":[{"onto_name":"eukaryotic cell","onto_id":"CL:0000255"},{"free_text":"Eukaryotic Protist"},{"free_text":"Ciliated Protist"}],"DATAQUALIFICATION":{"free_text":"PROCESSED;spatialmeasurements"},"ATTRIBUTION":{"URLs":[{"Href":"http:\/\/www5.pbrc.hawaii.edu\/allen\/"}],"Contributors":["Richard Allen"]},"TERMSANDCONDITIONS":{"free_text":"public_domain"},"ITEMTYPE":[{"onto_name":"transmission electron microscopy (TEM)","onto_id":"FBbi:00000258"},{"onto_name":"illumination by electrons","onto_id":"FBbi:00000273"}]}},"Data_type":{"Time_series":false,"Still_image":true,"Z_stack":false,"Video":false},"Citation":{"DOI":"doi:10.7295\/W9CIL10013","ARK":"ark:\/b7295\/w9cil10013","Title":"Richard Allen (2010) CIL:10013, Didinium nasutum, eukaryotic cell, Eukaryotic Protist, Ciliated Protist. CIL. Dataset"}}}}]}}
abbypenn93 commented 4 years ago

Are you in touch with Willy Wong (wawong@gmail.com)? I believe Willy is the person you'll need to work with.

lsitu commented 4 years ago

@abbypenn93 I looks like the the path CIL_CCDB.CIL.CORE.TERMSANDCONDITIONS.free_text in row#53 hasn't been corrected yet, which should be CIL_CCDB.CORE.TERMSANDCONDITIONS.free_text. Also. per our discussions above, we would like to map it tocopyrightStatus`, right?

lsitu commented 4 years ago

@abbypenn93 Never mind. I see the path is correct and I'll correct the mapping in my codes to Copyright status instead. But the "CIL Processing and Mapping Instructions" may need to update as well. Thanks.

abbypenn93 commented 4 years ago

Thanks, Longshou. I'll update CIL Processing and Mapping Instructions over the next few days.

lsitu commented 4 years ago

Thanks @abbypenn93. Please let me know if there's some other changes so that I can update the one used by damsmanager in my codes.

lsitu commented 4 years ago

@mcritchlow I've added PR https://github.com/ucsdlib/damsmanager/pull/359 to fix the inconsistent data and complex mapping issues that @abbypenn93 reported during the second round of QA. I also fix a weird issue found in CIL public api download, which was working (event in staging for August auto harvesting). However I see it broken this time with JSON source response received at the very beginning. After the JSON source issue is fixed by Willy, I see the results won't get updated and always returning a result with single CIL ID, event with the standard HttpParam to add parameters. Luckily it finally works with QueryString URLEncode. Not sure how could that happen now with the public CIL api though.

lsitu commented 4 years ago

@gamontoya Could we make a new release for damsmanager so that we can test it on staging for the CIL Harvesting QA again? Thanks.

abbypenn93 commented 4 years ago

It looks like the latest harvest/spreadsheet version is from August 1, but the metadata don't look complete. Is there another version that is ready review? Thanks, Abby

lsitu commented 4 years ago

@abbypenn93 I think the August 1 was harvested by the monthly auto process, which should be the same as the last version that you'd QA. The new version is on-going and I think it'll be done in a couple of days. I'll let you know once it's ready for QA.

But could you QA the monthly automatic process for August 1? We would expect new JSON source added in July to be harvested there. Thank you.

abbypenn93 commented 4 years ago

QA of cil_harvest_2019_08_01 harvest.

*3. Subject headings file The headings file (cil_excel_subject_headings.csv) contains subject terms (onto_name and free_text), but not the closeMatch content (onto_id).

*8. TERMSANDCONDITIONS - no copyright note column in output file.

*11. How can I confirm that only new objects are being harvested?

*15. Person:researcher In objects 50625, 50600, 50624, etc. only the institution is returned. Example:

"ATTRIBUTION": { "Contributors": [ "Adriana Handra-Luca,", "APHP University Paris Nord" ]

Please return both researcher or researcher's name and "APHP University Paris Nord" and DOMM will clean up the field later.

Also, objects with >1 researcher name are not complete. For example, object 50518: "ATTRIBUTION": { "URLs": [ { "Label": "Article", "Href": "https://link.springer.com/chapter/10.1007/978-1-4613-1657-2_6" } ], "Contributors": [ "Mark Ellisman", "Rama Ranganathan", "Thomas Deerinck", "Stephen Young", "David Hessler", "Robert Terry"

Only Robert Terry is listed for this object. Same with object 50583.

*20. Map fields to a single Note:technical details column

There are 2 Note:technical details columns -- all items should appear as a list in one column.

The list of parameters is incomplete. For instance, object 50625 contains "Imaging mode: microscopy" and "Item type" (in a separate column) but should also include "Visualization methods: x100".

The Note:Technical details entry for object 50625 should look like: Imaging mode: serial block face SEM (SBFSEM) Item type: recorded image Visualization methods: x100

-This particular harvest doesn't have a lot of objects with parameters that go in the Note:technical details column, but Visualization methods is missing from 38 objects (50600 - 50638).

These are pretty simple records compared to the last harvest so there are some fixes that I can't confirm as complete at this time.

Thanks, Abby

lsitu commented 4 years ago

@abbypenn93 Thanks. I think all the issues above should be fixed in the new version that we deployed to staging yesterday.

abbypenn93 commented 4 years ago

BTW: there are only 2 files present in the content_files directory for the August 8 harvest (even though the files are correctly reported in the spreadsheet)

lsitu commented 4 years ago

@abbypenn93 Thanks. Yes, I found the mistake on the content download configuration for staging and QA and I've corrected it: https://github.com/ucsdlib/private_config/pull/19.

@mcritchlow Could you review and merged PR https://github.com/ucsdlib/private_config/pull/19? Thank you.

Cell Type % intense % equivocal % none # assays