Retrieve information about SA's archive and torrent files

carakey commented 4 months ago

For the file inventory that happens with the annual Preservation Assessment process, it would be great to have information about repository contents that I'm not able to get with the other filesets from Solr. These include the contents of archive/container files and the torrent files that are stored separately. I'm looking for basic characterization info: file name, format, size.

This is something I want to pull once a year. This could be either a workflow + instructions + access to retrieve the information at my convenience, OR a list of the information as of X date.

The 2024 preservation assessment happens the last week of April. I can furnish a list of PIDs for container or torrent files if it helps.

lamtu1 commented 4 months ago

Hi @carakey, as looking through the ticket and on Scholars Archive, I have some questions hoping you could clarify more for me.

1. I was wondering would you want the report to be similar to the fixity report and send out an email with all the information listed above to run once a year?

2. On top of the report, would you like having just a button to give you the same report as the email mention above but you can use that button anytime you want to access it?

3. For the archive/container files and the torrent files do you know where we store that on SA?

Thank you!

carakey commented 4 months ago

Hi @lamtu1. The first option, an email report once a year, would be enough for my needs. I would love to have the second option available (the button) but expect that's a lot more work. Ryan O can answer the location question if @CGillen can't - I don't know the directory info. Container files like ZIPs are uploaded to SA like any other file and stored in the same location, but I understand the torrents are on a separate server.

lamtu1 commented 4 months ago

Thank you for the information @carakey, I will go ask around and look around and get that rake task to do the emailing on those files

carakey commented 4 months ago

(Following up from Slack)

The 2023 inventory turned up the following numbers by MIME type for torrent and archive files. Current IDs should be retrievable from Solr with this info.

Torrent files by mime_type_ssi, May 2023 counts:

application/x-bittorrent - 11

Archive/Container files by mime_type_ssi, May 2023 counts:

application/zip - 549
application/x-gzip - 12
application/x-tar - 6
application/x-7z-compressed - 2

lamtu1 commented 4 months ago

Hi @carakey, I have one more question I wanted to follow up with you. When you said I'm looking for basic characterization info: file name, format, size.. Does this mean you are looking for each file name, the format is it in, and the size of that file that is inside for example the zip file? Below is an example:

test.zip has 2 items -> File Name: a.pdf, format: PDF, size: 1KB -> File Name: b.png, format: PNG, size: 800KB

carakey commented 4 months ago

@lamtu1 Yes, exactly.

lamtu1 commented 1 month ago

@carakey I been doing the ticket and running into a few constraints. One of the constraint is that for the 7z-compressed we do have gem in Rails to extract and read the data in the file, but with the way Rails setup in SA@OSU, it cannot be install in the Dockerfile on our repo for it to work and pass the test and as I am looking for new gem to use, there aren't much gem in ruby that can help extract the 7z-compressed file.

Another constraint is that the x-bittorrent, there isn't a gem or a way in Rails to help extract the data and read it what inside the file. I was able to extract the others, but running into these constraints, it is a bit difficult to continue with this at the moment. I wanted to let you know and see what you would like to do to continue with it?

carakey commented 1 month ago

@lamtu1 I would say the 7z's are probably not worth spending more time on -- there are only 2 files. If you've had success with the zip, gzip, and tar files, that's really helpful on its own.

We can icebox the torrent files piece. I'd like to have a larger conversation about torrents with the research data services folks later this summer or fall.

carakey commented 1 month ago

How do we QA this one? It looks like it's set up to send out an email report. Can we do a test?

lamtu1 commented 1 month ago

This one, I'm not sure if we can rake task on the staging side, if so, we can run it to test it out, and then once the deposit period is over, we can push it to production and run from production for the entire SA.

carakey commented 3 weeks ago

QA:

Email:

The email report was successfully delivered to the SA inbox.
The email message body is confusing. Keep it simple: "The report on the Filesets Inventory of container/archive files is attached." (If there was meant to be any report data below, then it didn't come through; the only data is in the attachment.)

Data:

The information looks great overall!
The fileset PID is missing but very important.
There might be some data issues. Some of the headers say that the zip file contains more files than what is included in the listing. For example this item indicates 6 files but only 4 are listed:

The File 'python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba.zip' contains total of 6 file(s)
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/.gitmodules     [Format: gitmodules] - Byte sizes: 99
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/LICENSE     [Format: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/license] - Byte sizes: 1076
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/README.md     [Format: md] - Byte sizes: 56
File name: python-parallel-b92727900fc8a8ed93ed174bed7b5af715487dba/parallel.py     [Format: py] - Byte sizes: 5714

Format:

The TXT format works, and yes this matches what was proposed and approved in earlier comments.
However, a CSV would be even better if not too difficult. For example:

fileset_pid,container_filename,files_in_container,filename,format,size_in_bytes
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,FiguresS1-S6.pdf,pdf,3076400
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS1.mov,mov,1463954
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS2.mov,mov,980957
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,MovieS3.mov,mov,657601
{pid},BredewegErinMolecularCellularBiologyNeurosporaCrassaExocyst_SupplementalMaterials.zip,5,TableS1.xlsx,xlsx,53497

carakey commented 2 weeks ago

QA:

Email -

Looks good, pass!

Data:

Looks good as far as I can tell, since they are not available in the staging front end
Counting issues appear to be resolved - counted the rows with spreadsheet function, got 100% matching to expected number of files
Pass (but verify on Prod with real files)

Format:

CSV is great! One small request - add a header row?

If we can get the header row in for the CSV, we can QA pass and move forward

lamtu1 commented 2 weeks ago

@carakey would the header just be like this? fileset_pid,container_filename,files_in_container,filename,format,size_in_bytes

carakey commented 2 weeks ago

@lamtu1 Yes exactly

carakey commented 1 week ago

It looks like the header row is being added for every fileset. Can we just have it print out once at the top of the CSV file? Ultimately I'd like to view and work with the report data as a spreadsheet.

carakey commented 1 week ago

QA pass

osulp / Scholars-Archive

Retrieve information about SA's archive and torrent files #2569