Closed tarakc02 closed 1 year ago
How should we begin this step? I anticipate being set-up/having had figured out how to work remotely by tomorrow now that I'm up on Eleanor with the ability to pull data.
we can focus for now on the basic requirement, and ignore the note about "it would be nice ...". So basically we want to output a dataframe with three columns:
index-files/input/wrongful-convictions-docs/Abc_Def_XXXX/.../filename.ext
sha1sum path/to/filename.ext
Since version 3.11, python's hashlib includes a convenient file_digest
method. On eleanor, you should be able to activate the ipno-exonerations
conda environment, which has python3.11 set up. Alternatively, this stack overflow page looks like it has examples.
using sha1 hash. since there's so much data, this will take a few minutes to run. it would be nice to not have to rehash unchanged files if/when there are new files and we have to update the task.