Closed kieranjol closed 4 years ago
so I have a number of questions. here are the first two: is the example above 'and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info' correct? from what i think i understand the asterisk and the space should be removed?
are you coming across this particular externally created manifest pattern a lot? by 'pattern' i mean the asterisk and the space; by 'a lot' i mean from others apart from specialist av? I remember we had a doozey of a time trying to extract the first screen scene manifests we received. those one's were too complicated to even consider here, - but are there other complicated patterns that you have come across?
so here's my last question for the moment: i don't understand what you mean here: 'Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest'
could you be more specific about why it's throwing up errors from the naming of the modified manifest? is the problem with the file path and file naming of the individual file/s listed in the manifest or the name of the manifest itself?
and either way, - i'm not sure i understand why it would throw up errors if the newly created manifest now matches the pattern of our internally created manifests. i'm sure these are all easily explainable, - i'm just blanking, sorry
so I have a number of questions. here are the first two: is the example above 'and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info' correct? from what i think i understand the asterisk and the space should be removed?
Hi - yes it makes that file. And the asterix doesn't really need to be removed. Validate ignores these and the asterix is significant in manifests as far as i know, pretty sure other manifest tools use them, maybe it signifies that they're binary files or something? Anyhow our tools assume (like bagit and md5sum) that it's CHECKSUM(SPACE SPACE) filepath. So the file path is in the correct location here even if there's a star. Not sure what you mean by the spacing, but all looks good to me anyhow - maybe it's the newline? That's to be expected.
are you coming across this particular externally created manifest pattern a lot? by 'pattern' i mean the asterisk and the space; by 'a lot' i mean from others apart from specialist av?
The asterisk is explained above hopefully, normaly we see space space, but asterix space is fine. And ffmpeg uses such a styl where there's preceeding #hashes, and exactfile does somthing similar too along with teracopy. They actually use semicolons instead, so i should add that support and then we can support those as well.
I remember we had a doozey of a time trying to extract the first screen scene manifests we received. those one's were too complicated to even consider here, - but are there other complicated patterns that you have come across?
There generally seems to be a pattern - extra data at the top that has hashes or semicolons, and blank lines - so this script will focus on that.
so here's my last question for the moment: i don't understand what you mean here: 'Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest'
could you be more specific about why it's throwing up errors from the naming of the modified manifest? is the problem with the file path and file naming of the individual file/s listed in the manifest or the name of the manifest itself?
The issue here is that validate is expecting a file or folder called blablabla_modified to exist, which is not the case. In the above case, it is expecting something like MV8777.dv_modified to exist, and it's throwing up that initial error. But then if you ignore that, it just looks into the manifest, and does a fixity check on the files in there, and it should give a success.
and either way, - i'm not sure i understand why it would throw up errors if the newly created manifest now matches the pattern of our internally created manifests.
Yeah, the actual contents of the manifests are perfect, all is well there, it's more that the new modified manifest isn't a sidecar anymore cos it has 'modified' stuck in there.
i'm sure these are all easily explainable, - i'm just blanking, sorry
Seeing as we just validate vendor checksums rather than incorporate them into sips, i think we can handle these issues for the overall benefits of streamlining the workflow, without touching the original files provided by the vendor.
There's a bunch of different ways around this but i don't want to touch the original manifest if possible or rename it or anything like that, just in case something goes wrong..
thanks for explaining that to me. all makes sense and sounds good. merge away.
would you be in favour now of including whole file manifests supplied by vendor to be included in the sips? i don't mind, i just don;t thin kit is being done in the current ballymun spav workflow
thanks for explaining that to me. all makes sense and sounds good. merge away.
would you be in favour now of including whole file manifests supplied by vendor to be included in the sips? i don't mind, i just don;t thin kit is being done in the current ballymun spav workflow
Cheers for the review - as for the checksums - I think that as long as we can verify that the checksums in the sip match up to the vendor one then we're grand. I think i may have a script that does this already. Maybe going forward we should add them in as -supplements? Could be handy to have the info - also i think if they were in supplements - it would probably be easier to verifiy that the sip checksum for the video file is the same in the vendor manifest? It's probably a good idea to shove it in there, main concern is consistency for this project - we'd be changing it up, but I think -supplement is generally under the radar and doesn't effect stuff much.
Sorry for the long winded response but i think in summation - maybe we should put them in?
To be added - docs for new script.
Ok so we get manifests from vendors in a variety of formats and we have to transform them. This PR aims to ease that process by:
manifest_normalise.py
to create a normalised version of the manifestvalidate.py
andbatchaccession.py
to override potential errorsbatchvalidate
files received from our vendorRegarding the first point -
manifest_normalise.py
will find all md5 manifests, look through them and remove any lines starting with # or that just contain whitespace - these seem to be frequent culprints when vendors use GUI tools to make manifests. It then saves them as a sidecar to the existing manifest with a_modified_manifest.md5
suffix. The original vendor md5 is left intact. The_manifest.md5
is important as it causesbatchvalidate.py
to pick up on the manifest.So here's what MV8865.md5 looks like:
and here's what manifest_normalise.py creates - a file called MV8775.md5_modified_manifest.md5 which contains this info:
Here's what the output looks like:
Now validate and batchvalidate would throw up errors each time due to the naming of our modified manifest - it mistakingly thinks that there are files missing and it gives you that Y/N prompt to continue. So I added the -y option to override this so you can do a proper batch job. Here's what batchvalidate - previously only used for LTO fixity checks looks like when using this workflow - it's kinda messy due to the errors and also my generally terrible reporting - but you see the successes and failures: