Closed gamontoya closed 4 years ago
@gamontoya Should we exclude curator objects and UCSD-only objects from the report? Do you need to count of total on object or number of files?
@lsitu Include everything for this report. The count should be total number of files.
@gamontoya Also, do we need to count on those alternate files or not?
@lsitu What are alternate files?
Derivatives, you mean?
@lsitu If you mean derivatives, no. I want stats for the master copies of each format.
@gamontoya It looks like those alternate files could be the original files as will, like those second .zip files in mscl. And I believe there could be other cases that has alternate files.
It looks like we may need some rules to map the the following formats if there more than one formats found in and object: cartographic: https://library.ucsd.edu/dc/object/bb67930504 mixed material: https://library.ucsd.edu/dc/object/bb8571489c software: https://library.ucsd.edu/dc/object/bb3758941t three dimensional object: https://library.ucsd.edu/dc/object/bb0549894r
In the cases above, should we map the master files in each format? What's the mapping rules for the master files and it's format?
Longshou:
If there are more than one format found for an object, then count each separately and we should only map the master files in each format.
@gamontoya If there only one file but two formats like those example above in cartographic (image
and cartographic
) and software (data
and software
), you mean just count it for cartographic and software, and ignore the other format like image
and data
?
@gamontoya I've got the report for format sizes basing on the file use property of master files with file name like 1.*
OR 1
. The alternate files haven't been counted in the report.
Total objects found: 122435
Format | Total | Size (GB) |
---|---|---|
image | 109,024 | 2,919 |
data | 10,816 | 19,341 |
video | 1,592 | 11,896 |
audio | 20,478 | 6,759 |
text | 20,096 | 556 |
I found the following files have technical metadata problem and may need to clean up:
No technical metadata or missing files: bb0684082f _46_1.CR2: sourceFileName KEN_r09l043_r09b0255_002.CR2, use image-source bb9458122m _1_1.pdf: sourceFileName pres_b26636529_H.pdf, use document-source bb9492512z _1_1.pdf: sourceFileName pres_b17612470_H.pdf, use document-source bb97655254 _2_1.pdf: sourceFileName pres_b56000340_H.pdf, use document-source bb64890459 _1_1.pdf: sourceFileName pres_b32239671_H.pdf, use document-source bb5598799x _15_1.CR2: sourceFileName KEN_r09l034_r09B0159_007.CR2, use image-source bb2837018t _1_1.pdf: sourceFileName pres_b42361771_H.pdf, use document-source bb2666456j _1_1.pdf: sourceFileName pres_b43433315_H.pdf, use document-source bb9048700j _1_1.pdf: sourceFileName pres_b46427090_H.pdf, use document-source bb0345627p _1_1.pdf: sourceFileName pres_b16643951_H.pdf, use document-source bb34486983 1.wav: sourceFileName DMCA18202.wav, use audio-source bb51238605 _2_1.pdf: sourceFileName pres_b43455864_H.pdf, use document-source bb0888859j _4_1.CR2: sourceFileName KEN_w09l125_w09b0886_001.CR2, use image-source bb1742710g _2_1.wav: sourceFileName spc882_t1b.wav, use audio-source bb90147526 1.pdf: sourceFileName SIOGDC_GECS0JMV_20070608200255001_20070608200255001_GECS0JMV_cruise_report.pdf, use document-service bb04452633 1.wav: sourceFileName DMCA11325.wav, use audio-source bb32808394 _1_1.pdf: sourceFileName pres_b25920455_H.pdf, use document-source bb0994789m 1.pdf: sourceFileName 356183916071234737.pdf, use document-service
Empty Objects/Files: bb2869941n 1.tif bb92846612 _52_1.project (duplicate file URL)
@lsit Thank you. Can you look at the results we get when browsing by format in the DC:
https://library.ucsd.edu/dc/search/facet/object_type_sim?facet.sort=index
122,188 total vs your count of 162,006 -- do you know what's the difference?
Could you run the DC SPARQL query that currently exists for browse by format and add the size counts to an updated report/output or is that not possible?
@gamontoya I think the different is that the counts in my report are files, while the counts in damspas search result https://library.ucsd.edu/dc/search/facet/object_type_sim?facet.sort=index are objects.
I am running SOLR query for the report at this time. As we discussed yesterday, there is the issue regarding how to count and map the files to each format when there are multiple formats exist in an object, especially when there are several formats and many files in an complex object. We may need to discuss and setup some rules to regenerate the report. Could you add those rules for mapping the files to format to the spec? Thank you.
@gamontoya I run another report on solr to include the object counts for each format determined by file use, which include curator objects. If an object contains a file with file use starting with that format, that format will increase object count by 1. Here is the report: | Format | Object Count | Files Count | Size (GB) |
---|---|---|---|---|
image | 72,231 | 109,024 | 2,919 | |
data | 8,383 | 10,816 | 19,341 | |
video | 1,060 | 1,592 | 11,896 | |
audio | 18,075 | 20,478 | 6,759 | |
text | 19,201 | 20,096 | 556 |
Just let me know if you need to generate the report in the formats like the search result in damspas.
@gamontoya Here is the report with the format from the damspas facet page: | Format | Object Counts | File Counts | Size (GB) |
---|---|---|---|---|
cartographic | 1,071 | 54 | 10 | |
data | 7,746 | 8,151 | 14,615 | |
image | 70,537 | 103,929 | 3,691 | |
software | 5 | 26 | 27 | |
text | 26,758 | 40,649 | 6,395 | |
three dimensional object | 15 | 18 | 2 | |
video | 1,060 | 3,161 | 11,967 | |
mixed material | 5 | 17 | 1 |
The following rules are applying for counting the master files to a format in order:
Note: The format size will be different if the above rules changed, especial for data
, text
, and software
.
@gamontoya I explored the files counted and rerun the report again. I am seeing different results with some files are counted more than once. It seems like it's hard to determine which format a file should be counted in complex objects that have more than one format and lots of component files with different extensions, which will generate a confusing results with files counted more than once.
For example object https://library.ucsd.edu/dc/object/bb6520310z that has three formats (text
,image
and data
) and several .pdf
files, should we map all PDF files to image format?
Another more complex case is object https://library.ucsd.edu/dc/object/bb11995109 with four formats (software
,three dimensional object
,data
and image
) and more than 10 files with extensions like .zip
, .exr
, .unitypackage
, .png
, `.x3d etc. I think we have lots of objects like bb11995109. How should we map these files to each format? Is there a general rule to map the master files to a format by file use or file extension? Or could we count all master files in an object to the format that is hit with a search?
Another issue I see is that some master files in an object is scooped up by other formats and we don't know how to count them. For example, object https://library.ucsd.edu/dc/object/bb67249163 has two formats, data
and video
, but there are several images files in the object and neither video nor data we can mapped them to.
Also objects with no format/typeOfResource metadata won't hit by the format facet search as well, and they won't be counted in the report.
@lsitu For this report, just
@gamontoya Do you mean if there are more than one format in an object, just map all the master files in that object to one format like data
if it exists? Do you want a list of those formats in the objects level? We have some objects that do not have any format/typeOfResource metadata in the object level but only in components, which is different from component to component. How should we deal with them then?
@lsitu This report doesn't have to be perfect and it sounds like it's getting too complicated.
I think at this point I'll just go with what you reported here:
Format | Object Counts | File Counts | Size (GB) |
---|---|---|---|
cartographic | 1,071 | 54 | 10 |
data | 7,746 | 8,151 | 14,615 |
image | 70,537 | 103,929 | 3,691 |
software | 5 | 26 | 27 |
text | 26,758 | 40,649 | 6,395 |
three dimensional object | 15 | 18 | 2 |
video | 1,060 | 3,161 | 11,967 |
mixed material | 5 | 17 | 1 |
@gamontoya Got it. But there are some master files are counted more than once for complex objects with more than one formats. Do you want to erase those counts?
@lsitu Yes, you can update the counts for those complex objects for which we are only counting one format type.
@gamontoya Here are the list of 1663 objects that don't have object level format/typeOfResource metadata. Note that some are empty objects with no descriptive metadata, and some of them may have component level format metadata: report-objects-no-formats-1663.txt
Descriptive summary
Please provide a report (csv, tab-delimited are okay) pulling out content from the DAMS by format type and the total size.
For example
Rationale
I need to provide these numbers to the UC Digitial Preservation Strategy Working Group by February 21.