ucsdlib / damspas

UC San Diego DAMS Hydra Head
Other
7 stars 5 forks source link

Create formats report #745

Closed gamontoya closed 4 years ago

gamontoya commented 4 years ago

Descriptive summary

Please provide a report (csv, tab-delimited are okay) pulling out content from the DAMS by format type and the total size.

For example

Rationale

I need to provide these numbers to the UC Digitial Preservation Strategy Working Group by February 21.

lsitu commented 4 years ago

@gamontoya Should we exclude curator objects and UCSD-only objects from the report? Do you need to count of total on object or number of files?

gamontoya commented 4 years ago

@lsitu Include everything for this report. The count should be total number of files.

lsitu commented 4 years ago

@gamontoya Also, do we need to count on those alternate files or not?

gamontoya commented 4 years ago

@lsitu What are alternate files?

gamontoya commented 4 years ago

Derivatives, you mean?

gamontoya commented 4 years ago

@lsitu If you mean derivatives, no. I want stats for the master copies of each format.

lsitu commented 4 years ago

@gamontoya It looks like those alternate files could be the original files as will, like those second .zip files in mscl. And I believe there could be other cases that has alternate files.

It looks like we may need some rules to map the the following formats if there more than one formats found in and object: cartographic: https://library.ucsd.edu/dc/object/bb67930504 mixed material: https://library.ucsd.edu/dc/object/bb8571489c software: https://library.ucsd.edu/dc/object/bb3758941t three dimensional object: https://library.ucsd.edu/dc/object/bb0549894r

In the cases above, should we map the master files in each format? What's the mapping rules for the master files and it's format?

gamontoya commented 4 years ago

Longshou:

If there are more than one format found for an object, then count each separately and we should only map the master files in each format.

lsitu commented 4 years ago

@gamontoya If there only one file but two formats like those example above in cartographic (image and cartographic) and software (data and software), you mean just count it for cartographic and software, and ignore the other format like image and data?

lsitu commented 4 years ago

@gamontoya I've got the report for format sizes basing on the file use property of master files with file name like 1.* OR 1. The alternate files haven't been counted in the report. Total objects found: 122435

Format Total Size (GB)
image 109,024 2,919
data 10,816 19,341
video 1,592 11,896
audio 20,478 6,759
text 20,096 556

I found the following files have technical metadata problem and may need to clean up:

gamontoya commented 4 years ago

@lsit Thank you. Can you look at the results we get when browsing by format in the DC:

https://library.ucsd.edu/dc/search/facet/object_type_sim?facet.sort=index

output

122,188 total vs your count of 162,006 -- do you know what's the difference?

Could you run the DC SPARQL query that currently exists for browse by format and add the size counts to an updated report/output or is that not possible?

lsitu commented 4 years ago

@gamontoya I think the different is that the counts in my report are files, while the counts in damspas search result https://library.ucsd.edu/dc/search/facet/object_type_sim?facet.sort=index are objects.

lsitu commented 4 years ago

I am running SOLR query for the report at this time. As we discussed yesterday, there is the issue regarding how to count and map the files to each format when there are multiple formats exist in an object, especially when there are several formats and many files in an complex object. We may need to discuss and setup some rules to regenerate the report. Could you add those rules for mapping the files to format to the spec? Thank you.

lsitu commented 4 years ago
@gamontoya I run another report on solr to include the object counts for each format determined by file use, which include curator objects. If an object contains a file with file use starting with that format, that format will increase object count by 1. Here is the report: Format Object Count Files Count Size (GB)
image 72,231 109,024 2,919
data 8,383 10,816 19,341
video 1,060 1,592 11,896
audio 18,075 20,478 6,759
text 19,201 20,096 556

Just let me know if you need to generate the report in the formats like the search result in damspas.

lsitu commented 4 years ago
@gamontoya Here is the report with the format from the damspas facet page: Format Object Counts File Counts Size (GB)
cartographic 1,071 54 10
data 7,746 8,151 14,615
image 70,537 103,929 3,691
software 5 26 27
text 26,758 40,649 6,395
three dimensional object 15 18 2
video 1,060 3,161 11,967
mixed material 5 17 1

The following rules are applying for counting the master files to a format in order:

Note: The format size will be different if the above rules changed, especial for data, text, and software.

lsitu commented 4 years ago

@gamontoya I explored the files counted and rerun the report again. I am seeing different results with some files are counted more than once. It seems like it's hard to determine which format a file should be counted in complex objects that have more than one format and lots of component files with different extensions, which will generate a confusing results with files counted more than once. For example object https://library.ucsd.edu/dc/object/bb6520310z that has three formats (text,image and data) and several .pdf files, should we map all PDF files to image format? Another more complex case is object https://library.ucsd.edu/dc/object/bb11995109 with four formats (software,three dimensional object,data and image) and more than 10 files with extensions like .zip, .exr, .unitypackage, .png, `.x3d etc. I think we have lots of objects like bb11995109. How should we map these files to each format? Is there a general rule to map the master files to a format by file use or file extension? Or could we count all master files in an object to the format that is hit with a search?

lsitu commented 4 years ago

Another issue I see is that some master files in an object is scooped up by other formats and we don't know how to count them. For example, object https://library.ucsd.edu/dc/object/bb67249163 has two formats, data and video, but there are several images files in the object and neither video nor data we can mapped them to. Also objects with no format/typeOfResource metadata won't hit by the format facet search as well, and they won't be counted in the report.

gamontoya commented 4 years ago

@lsitu For this report, just

lsitu commented 4 years ago

@gamontoya Do you mean if there are more than one format in an object, just map all the master files in that object to one format like data if it exists? Do you want a list of those formats in the objects level? We have some objects that do not have any format/typeOfResource metadata in the object level but only in components, which is different from component to component. How should we deal with them then?

gamontoya commented 4 years ago

@lsitu This report doesn't have to be perfect and it sounds like it's getting too complicated.

I think at this point I'll just go with what you reported here:

Format Object Counts File Counts Size (GB)
cartographic 1,071 54 10
data 7,746 8,151 14,615
image 70,537 103,929 3,691
software 5 26 27
text 26,758 40,649 6,395
three dimensional object 15 18 2
video 1,060 3,161 11,967
mixed material 5 17 1
lsitu commented 4 years ago

@gamontoya Got it. But there are some master files are counted more than once for complex objects with more than one formats. Do you want to erase those counts?

gamontoya commented 4 years ago

@lsitu Yes, you can update the counts for those complex objects for which we are only counting one format type.

lsitu commented 4 years ago

@gamontoya Here are the list of 1663 objects that don't have object level format/typeOfResource metadata. Note that some are empty objects with no descriptive metadata, and some of them may have component level format metadata: report-objects-no-formats-1663.txt