Open carakey opened 4 months ago
Hi @carakey, as looking through the ticket and on Scholars Archive, I have some questions hoping you could clarify more for me.
1.
I was wondering would you want the report to be similar to the fixity report and send out an email with all the information listed above to run once a year?
2.
On top of the report, would you like having just a button to give you the same report as the email mention above but you can use that button anytime you want to access it?
3.
For the archive/container files and the torrent files
do you know where we store that on SA?
Thank you!
Hi @lamtu1. The first option, an email report once a year, would be enough for my needs. I would love to have the second option available (the button) but expect that's a lot more work. Ryan O can answer the location question if @CGillen can't - I don't know the directory info. Container files like ZIPs are uploaded to SA like any other file and stored in the same location, but I understand the torrents are on a separate server.
Thank you for the information @carakey, I will go ask around and look around and get that rake task to do the emailing on those files
(Following up from Slack)
The 2023 inventory turned up the following numbers by MIME type for torrent and archive files. Current IDs should be retrievable from Solr with this info.
Torrent files by mime_type_ssi
, May 2023 counts:
Archive/Container files by mime_type_ssi
, May 2023 counts:
Hi @carakey, I have one more question I wanted to follow up with you. When you said I'm looking for basic characterization info: file name, format, size.
. Does this mean you are looking for each file name, the format is it in, and the size of that file
that is inside for example the zip
file? Below is an example:
test.zip
has 2 items
-> File Name: a.pdf
, format: PDF
, size: 1KB
-> File Name: b.png
, format: PNG
, size: 800KB
@lamtu1 Yes, exactly.
@carakey I been doing the ticket and running into a few constraints. One of the constraint is that for the 7z-compressed
we do have gem in Rails to extract and read the data in the file, but with the way Rails setup in SA@OSU, it cannot be install in the Dockerfile
on our repo for it to work and pass the test and as I am looking for new gem to use, there aren't much gem in ruby that can help extract the 7z-compressed
file.
Another constraint is that the x-bittorrent
, there isn't a gem or a way in Rails to help extract the data and read it what inside the file. I was able to extract the others, but running into these constraints, it is a bit difficult to continue with this at the moment. I wanted to let you know and see what you would like to do to continue with it?
@lamtu1 I would say the 7z's are probably not worth spending more time on -- there are only 2 files. If you've had success with the zip, gzip, and tar files, that's really helpful on its own.
We can icebox the torrent files piece. I'd like to have a larger conversation about torrents with the research data services folks later this summer or fall.
How do we QA this one? It looks like it's set up to send out an email report. Can we do a test?
This one, I'm not sure if we can rake task on the staging side, if so, we can run it to test it out, and then once the deposit period is over, we can push it to production and run from production for the entire SA.
QA:
Email:
Data:
The File 'python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba.zip' contains total of 6 file(s)
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/.gitmodules [Format: gitmodules] - Byte sizes: 99
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/LICENSE [Format: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/license] - Byte sizes: 1076
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/README.md [Format: md] - Byte sizes: 56
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/parallel.py [Format: py] - Byte sizes: 5714
Format:
fileset_pid,container_filename,files_in_container,filename,format,size_in_bytes
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,FiguresS1-S6.pdf,pdf,3076400
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS1.mov,mov,1463954
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS2.mov,mov,980957
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS3.mov,mov,657601
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,TableS1.xlsx,xlsx,53497
QA:
Email -
Data:
Format:
If we can get the header row in for the CSV, we can QA pass and move forward
@carakey would the header just be like this?
fileset_pid,container_filename,files_in_container,filename,format,size_in_bytes
@lamtu1 Yes exactly
It looks like the header row is being added for every fileset. Can we just have it print out once at the top of the CSV file? Ultimately I'd like to view and work with the report data as a spreadsheet.
QA pass
For the file inventory that happens with the annual Preservation Assessment process, it would be great to have information about repository contents that I'm not able to get with the other filesets from Solr. These include the contents of archive/container files and the torrent files that are stored separately. I'm looking for basic characterization info: file name, format, size.
This is something I want to pull once a year. This could be either a workflow + instructions + access to retrieve the information at my convenience, OR a list of the information as of X date.
The 2024 preservation assessment happens the last week of April. I can furnish a list of PIDs for container or torrent files if it helps.