To do more analysis (e.g. grabbing descriptive parentheticals) on our case law, we need to introduce opinion text into our database.
One consideration: CourtListener's data includes opinion text in HTML form. Should we scrub it then store it, or leave the formatting and scrub as necessary when pulling from the database?
Storing a scrubbed version would really help in terms of performance and searching.
But keeping an HTML version keeps more flexibility open.
Storing both is also an option but the storage cost is pretty significant.
To do more analysis (e.g. grabbing descriptive parentheticals) on our case law, we need to introduce opinion text into our database.
One consideration: CourtListener's data includes opinion text in HTML form. Should we scrub it then store it, or leave the formatting and scrub as necessary when pulling from the database?
Storing a scrubbed version would really help in terms of performance and searching.
But keeping an HTML version keeps more flexibility open.
Storing both is also an option but the storage cost is pretty significant.