kieranjol / IFIscripts

Detailed documentation is available here: http://ifiscripts.readthedocs.io/en/latest/index.html
http://ifiscripts.readthedocs.io/en/latest/index.html
MIT License
50 stars 34 forks source link

Manifest transformation and batchvalidations #383

Closed kieranjol closed 4 years ago

kieranjol commented 4 years ago

To be added - docs for new script.

Ok so we get manifests from vendors in a variety of formats and we have to transform them. This PR aims to ease that process by:

Regarding the first point - manifest_normalise.py will find all md5 manifests, look through them and remove any lines starting with # or that just contain whitespace - these seem to be frequent culprints when vendors use GUI tools to make manifests. It then saves them as a sidecar to the existing manifest with a _modified_manifest.md5 suffix. The original vendor md5 is left intact. The _manifest.md5 is important as it causes batchvalidate.py to pick up on the manifest.

So here's what MV8865.md5 looks like:

# MD5 checksums generated by MD5summer (http://www.md5summer.org)
# Generated 11/06/2020 13:29:54

7b694dd10a21441def3d01efe399ab84 *MV8776.dv

and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info:

7b694dd10a21441def3d01efe399ab84 *MV8776.dv

Here's what the output looks like:

C:\Users\kiera\ifigit\ifiscripts>manifest_normalise.py  E:\DV_batches\IFI_Batch_10
2020-06-18 09:37:12,349 - changing MV8776.md5
2020-06-18 09:37:12,350 - changing MV8805.md5
2020-06-18 09:37:12,352 - changing MV8806.md5
2020-06-18 09:37:12,353 - changing MV8807.md5
2020-06-18 09:37:12,355 - changing MV8808.md5
2020-06-18 09:37:12,356 - changing MV8809.md5
2020-06-18 09:37:12,358 - changing MV8810.md5
2020-06-18 09:37:12,359 - changing MV8811.md5
2020-06-18 09:37:12,360 - changing MV8905.md5
2020-06-18 09:37:12,361 - changing MV8906.md5
2020-06-18 09:37:12,362 - changing MV8907.md5
2020-06-18 09:37:12,363 - changing MV8908.md5
2020-06-18 09:37:12,365 - changing MV8909.md5
2020-06-18 09:37:12,366 - changing MV8910.md5
2020-06-18 09:37:12,367 - changing MV8985.md5
2020-06-18 09:37:12,368 - changing MV8986.md5

Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest - it mistakingly thinks that there are files missing and it gives you that Y/N prompt to continue. So I added the -y option to override this so you can do a proper batch job. Here's what batchvalidate - previously only used for LTO fixity checks looks like when using this workflow - it's kinda messy due to the errors and also my generally terrible reporting - but you see the successes and failures:

batchvalidate.py -y E:\DV_batches\IFI_Batch_10
E:\DV_batches\IFI_Batch_10\mv8776\MV8776.md5_modified_manifest.md5
Changing directory to C:\Users\kieran.oleary\ifigit\ifiscripts to extract script version`
[]
 - There is mismatch between your file count and the manifest file count
 - checking which files are different
All files present
Validating MV8776.dv
MV8776.dv has validated
All checksums have validated
['E:\\DV_batches\\IFI_Batch_10\\mv8776', 'success']
E:\DV_batches\IFI_Batch_10\MV8805\MV8805.md5_modified_manifest.md5
Changing directory to C:\Users\kieran.oleary\ifigit\ifiscripts to extract script version`
[]
 - There is mismatch between your file count and the manifest file count
 - checking which files are different
All files present
Validating MV8805.m2t
MV8805.m2t has validated
All checksums have validated
['E:\\DV_batches\\IFI_Batch_10\\mv8776', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8805', 'success']
E:\DV_batches\IFI_Batch_10\MV8806\MV8806.md5_modified_manifest.md5
Changing directory to C:\Users\kieran.oleary\ifigit\ifiscripts to extract script version`
[]
 - There is mismatch between your file count and the manifest file count
 - checking which files are different
All files present
Validating MV8806.dv
MV8806.dv has validated
All checksums have validated
['E:\\DV_batches\\IFI_Batch_10\\mv8776', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8805', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8806', 'success']
E:\DV_batches\IFI_Batch_10\MV8807\MV8807.md5_modified_manifest.md5
Changing directory to C:\Users\kieran.oleary\ifigit\ifiscripts to extract script version`
[]
 - There is mismatch between your file count and the manifest file count
 - checking which files are different
All files present
Validating MV8807.m2t
MV8807.m2t has validated
All checksums have validated
['E:\\DV_batches\\IFI_Batch_10\\mv8776', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8805', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8806', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8807', 'success']
E:\DV_batches\IFI_Batch_10\MV8808\MV8808.md5_modified_manifest.md5
Changing directory to C:\Users\kieran.oleary\ifigit\ifiscripts to extract script version`
[]
 - There is mismatch between your file count and the manifest file count
 - checking which files are different
All files present
Validating MV8808.dv
MV8808.dv has validated
All checksums have validated
['E:\\DV_batches\\IFI_Batch_10\\mv8776', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8805', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8806', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8807', 'success']
['E:\\DV_batches\\IFI_Batch_10\\MV8808', 'success']
raecasey commented 4 years ago

so I have a number of questions. here are the first two: is the example above 'and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info' correct? from what i think i understand the asterisk and the space should be removed?

are you coming across this particular externally created manifest pattern a lot? by 'pattern' i mean the asterisk and the space; by 'a lot' i mean from others apart from specialist av? I remember we had a doozey of a time trying to extract the first screen scene manifests we received. those one's were too complicated to even consider here, - but are there other complicated patterns that you have come across?

raecasey commented 4 years ago

so here's my last question for the moment: i don't understand what you mean here: 'Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest'

could you be more specific about why it's throwing up errors from the naming of the modified manifest? is the problem with the file path and file naming of the individual file/s listed in the manifest or the name of the manifest itself?

and either way, - i'm not sure i understand why it would throw up errors if the newly created manifest now matches the pattern of our internally created manifests. i'm sure these are all easily explainable, - i'm just blanking, sorry

kieranjol commented 4 years ago

so I have a number of questions. here are the first two: is the example above 'and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info' correct? from what i think i understand the asterisk and the space should be removed?

Hi - yes it makes that file. And the asterix doesn't really need to be removed. Validate ignores these and the asterix is significant in manifests as far as i know, pretty sure other manifest tools use them, maybe it signifies that they're binary files or something? Anyhow our tools assume (like bagit and md5sum) that it's CHECKSUM(SPACE SPACE) filepath. So the file path is in the correct location here even if there's a star. Not sure what you mean by the spacing, but all looks good to me anyhow - maybe it's the newline? That's to be expected.

are you coming across this particular externally created manifest pattern a lot? by 'pattern' i mean the asterisk and the space; by 'a lot' i mean from others apart from specialist av?

The asterisk is explained above hopefully, normaly we see space space, but asterix space is fine. And ffmpeg uses such a styl where there's preceeding #hashes, and exactfile does somthing similar too along with teracopy. They actually use semicolons instead, so i should add that support and then we can support those as well.

I remember we had a doozey of a time trying to extract the first screen scene manifests we received. those one's were too complicated to even consider here, - but are there other complicated patterns that you have come across?

There generally seems to be a pattern - extra data at the top that has hashes or semicolons, and blank lines - so this script will focus on that.

kieranjol commented 4 years ago

so here's my last question for the moment: i don't understand what you mean here: 'Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest'

could you be more specific about why it's throwing up errors from the naming of the modified manifest? is the problem with the file path and file naming of the individual file/s listed in the manifest or the name of the manifest itself?

The issue here is that validate is expecting a file or folder called blablabla_modified to exist, which is not the case. In the above case, it is expecting something like MV8777.dv_modified to exist, and it's throwing up that initial error. But then if you ignore that, it just looks into the manifest, and does a fixity check on the files in there, and it should give a success.

and either way, - i'm not sure i understand why it would throw up errors if the newly created manifest now matches the pattern of our internally created manifests.

Yeah, the actual contents of the manifests are perfect, all is well there, it's more that the new modified manifest isn't a sidecar anymore cos it has 'modified' stuck in there.

i'm sure these are all easily explainable, - i'm just blanking, sorry

Seeing as we just validate vendor checksums rather than incorporate them into sips, i think we can handle these issues for the overall benefits of streamlining the workflow, without touching the original files provided by the vendor.

kieranjol commented 4 years ago

There's a bunch of different ways around this but i don't want to touch the original manifest if possible or rename it or anything like that, just in case something goes wrong..

raecasey commented 4 years ago

thanks for explaining that to me. all makes sense and sounds good. merge away.

would you be in favour now of including whole file manifests supplied by vendor to be included in the sips? i don't mind, i just don;t thin kit is being done in the current ballymun spav workflow

kieranjol commented 4 years ago

thanks for explaining that to me. all makes sense and sounds good. merge away.

would you be in favour now of including whole file manifests supplied by vendor to be included in the sips? i don't mind, i just don;t thin kit is being done in the current ballymun spav workflow

Cheers for the review - as for the checksums - I think that as long as we can verify that the checksums in the sip match up to the vendor one then we're grand. I think i may have a script that does this already. Maybe going forward we should add them in as -supplements? Could be handy to have the info - also i think if they were in supplements - it would probably be easier to verifiy that the sip checksum for the video file is the same in the vendor manifest? It's probably a good idea to shove it in there, main concern is consistency for this project - we'd be changing it up, but I think -supplement is generally under the radar and doesn't effect stuff much.

Sorry for the long winded response but i think in summation - maybe we should put them in?