Closed newswim closed 1 year ago
A relevant Twitter thread:
https://twitter.com/rtfeldman/status/1589061017031823360?s=46&t=cMS9ovlbxGml4vbiOjrawg
Some interesting hashing benchmarks (from that thread)
One thing to note is that theoretically two counties could have the same case id, so I believe we discussed [case-id]:[county]:[date]:[hash].html as an option too
Implemented this in #11 . Just to be sure, by case-id here we're referring to the odyssey case id, and not the case id assigned by the courts? To use the latter we'd have to be parsing more as part of the scraper (though we're already hashing the data)
Changed to use court case id in 8d47f4c
Here are some notes from the conversations around blob naming:
While we eventually want to persist all of the files that we've ever scraped, for now it seems reasonable to just upsert the case record into a container.
In order to prepare for parsing on an as-needed basis, we had the idea to use a hash of the file's contents in its name. (addressability FTW!).
Here's an example naming convention that would contain some helpful metadata that we could use to compare:
[case-id]:[date]:[hash].html
🚨 Important note
Blob file names have a maximum character length of 1024 characters.
Source: https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#blob-names