Establish a naming convention for blobs

newswim commented 1 year ago

Here are some notes from the conversations around blob naming:

While we eventually want to persist all of the files that we've ever scraped, for now it seems reasonable to just upsert the case record into a container.

In order to prepare for parsing on an as-needed basis, we had the idea to use a hash of the file's contents in its name. (addressability FTW!).

Here's an example naming convention that would contain some helpful metadata that we could use to compare:

[case-id]:[date]:[hash].html

🚨 Important note

Blob file names have a maximum character length of 1024 characters.

Source: https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#blob-names

newswim commented 1 year ago

A relevant Twitter thread:

https://twitter.com/rtfeldman/status/1589061017031823360?s=46&t=cMS9ovlbxGml4vbiOjrawg

newswim commented 1 year ago

Some interesting hashing benchmarks (from that thread)

https://cyan4973.github.io/xxHash

normaljosh commented 1 year ago

One thing to note is that theoretically two counties could have the same case id, so I believe we discussed [case-id]:[county]:[date]:[hash].html as an option too

normaljosh commented 1 year ago

Implemented this in #11 . Just to be sure, by case-id here we're referring to the odyssey case id, and not the case id assigned by the courts? To use the latter we'd have to be parsing more as part of the scraper (though we're already hashing the data)

normaljosh commented 1 year ago

Changed to use court case id in 8d47f4c

open-austin / azure-indigent-defense

Establish a naming convention for blobs #5

🚨 Important note