mitmedialab / open-us-legal-corpus-creator

MIT License
0 stars 0 forks source link

Deciding on an approach to ETL #4

Open umarbutler opened 7 months ago

umarbutler commented 7 months ago

Hey team, As discussed over Discord, before we can get started scraping and processing data, we're going to need to decide on an approach to ETL. In particular, there are two key options I see:

  1. Create a bunch of one-off scripts to scrape and process data for each data source.
  2. Create a single modular, interoperable ETL framework that can be used to scrape and process data from all data sources.

At least in the short-term, the first option, creating one-off scripts, would undoubtedly be easier. @kazimuth and I could write our scripts independently without having to worry about making them pretty. We could dump our data into different storage containers and only aggregate it all at the end. The only real downside to this approach is that it is not easily sustainable and, in fact, if we wish to regularly update our corpus, it would end up taking more work to maintain than the second option.

With the second approach, once an ETL framework had been established, it would easy for anyone, including members of the public, to extend the corpus by adding new data sources, and it would also be very easy for us to update the corpus with new data as it arrives. This is the approach I took with the Open Australian Legal Corpus (OALC). Luckily, we would be able to use most of my existing codebase, so a lot of the upfront costs associated with building an ETL framework would be taken care of. The main change that would need to occur, however, is that I would need to refactor my codebase to distribute the corpus across multiple files instead of dumping it all in a single json lines file (although that worked for OALC, the US has many more jursidictions than Australia).

I'm not entirely sure of what I prefer. On the one hand, creating a bunch of one-off scripts would allow us to jump straight into the data very quickly, and would save us the time of having to worry about making everything look pretty. On the other hand, if we decide to change direction later on and want to have the corpus updated, say, every month (or perhaps we just don't want to have to rescrape the entire corpus every time we want to update amended laws and regulations), then it could be a much more painful process to integrate all those one-off scripts into a unified framework. Also, having a framework would actually speed up the time it takes us to collect data since we wouldn't be spamming the same server over and over, and instead could distribute requests across multiple sources.

I think I'm leaning towards one-off scripts at the moment but I'm keen to get your thoughts :)

kazimuth commented 7 months ago

I think at least having a shared data output method makes sense. Something like

from enum import Enum

class DocumentType(Enum):
   PRIMARY_LEGISLATION = 'primary_legislation'
   SECONDARY_LEGISLATION = 'secondary_legislation'
   BILL = 'bill'
   DECISION = 'decision'

def insert_document(id: str, type: DocumentType, jurisdiction: str, 
    source: str, time: datetime, citation: str, url: str,
    downloaded: datetime, text: str) -> None:
    ...

It might be worth writing the data to a sqlite file, then we can update our tables easily without having to repeatedly read + write giant json files. Sqlite will be able to handle however much data we want to stick in it. We can convert to json / other tabular formats for export.

I'm not sure what else goes in the ETL framework, I think aside from this it's good to have one off scripts. This way our scripts have a shared backend, we don't have to worry about having a bunch of different scraping output formats.

umarbutler commented 7 months ago

I think at least having a shared data output method makes sense.

Agreed. I didn’t explicitly mention it in my post but our scripts would have to output to a single format following the schema decided in #2.

It might be worth writing the data to a sqlite file, then we can update our tables easily without having to repeatedly read + write giant json files. Sqlite will be able to handle however much data we want to stick in it. We can convert to json / other tabular formats for export.

Another good point! I didn’t think about the possibility of using a database.

My main concern with adapting OALC’s codebase was that I would need to figure out how to implement things like process resumption, automated data repair and stale data identification across several multi-gigabyte json lines files instead of just one.

If we were to use a database instead, it would solve a lot of those concerns for me. I reckon we should go ahead with using a single ETL framework built atop a database.

Some key benefits would include:

Let me know if you’re happy with this approach and then I can identify areas of the codebase that we can work on refactoring.

kazimuth commented 7 months ago

Sounds perfect. Let me know if there's anything you'd like me to poke at.

umarbutler commented 7 months ago

Upon further reflection, I actually think we ought to go for the former option of relying on one-off scripts instead of adapting OALC's ETL framework 😅 The reason being that it would allow us to jump straight into things without having to spend time building complex data infrastructure that we may never fully utilise or that may turn out not to fit our use cases. When I built my unified scraper package, I had already spent months working with the raw data, getting an understanding of what design patterns could be generalised across all data sources. We probably need to go through that process again and I think using one scripts will be the quickest way to do that.

Later on down the track, we can clean up and refactor our codebase if needs be.

I'll get back to you with the next steps in a couple days :)