r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Add correct text ordering and column merging to CL #83

Open conceptofmind opened 4 months ago

conceptofmind commented 4 months ago

The columns in CL need to be coalesced in the order:

"html_with_citations"
"html_columbia"
"html_lawbox"
"xml_harvard"
"html_anon_2020"
"html"
"plain_text"

The plain_text only contains a smaller low-quality subset of the total data. Using only the plain_text column misses much of the information that is available in the other columns. Not all rows contain information. To get the largest quantity of proper CL data possible without additionally scraping the .txt files it is required to merge the columns in the order above.

Harvard in cold cases and CAP chose to use regex to do initial text extraction. This can be replaced with something like Trafilatura in one line.

I tried to keep this similar to what zhenlinx had. I did not remove their script.

The script:

I iterated through the jsonl.gz file to make sure it looked ~normal.

Additional filtering and boilerplate removal will need to be added in another PR.

This leaves you with the "correct" initial CL data following guidelines by Harvard CAP and CL/Mike Lissner.

@blester125

conceptofmind commented 4 months ago

A side note. If I am to add Spark processing I will need to make additional adjustments to other parts of the scripts.

The combines are separated by what needs to be extracted and the plain text. This was in case a different text extractor was chosen. Trafilatura seemed to over filter the plain text if not done independently.

I have run this semi-processed version and uploaded parquets.

craffel commented 4 months ago

Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?

conceptofmind commented 4 months ago

Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?

Sounds good. I will merge it into the existing one.

conceptofmind commented 4 months ago

I listed out several changes that should be made, currently the output format is not correct.

I can't see needing something like SPARK for this dataset. If there are OoM or slowness issues due to pandas reading the whole CSV at once it should be easy to update to 1) Read the CSV line-by-line via a generator that uses the csv from stdlib 2) parallelized the conversion from csv record to dolma record with multiprocessing.Pool.imap 3) pass the resulting iterator to the to_dolma function.

That should avoid reading the whole file at once and leverage the embarrassingly parallel nature of the task.

I tried to keep it as close to the original as possible. There are numerous different changes I would have personally made. I can integrate those instead as well.

blester125 commented 4 months ago

Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here https://github.com/r-three/common-pile/pull/83/commits/9df6741fb085c6d93a3acc604c20eb1397f8679b)

conceptofmind commented 4 months ago

I will resolve each of the comments. However, the text extractor should be decided upon before I do so. This is a bare minimal reproduction of what was done previously in Spark but following what was already there in the original PR for this project/dolma format.

conceptofmind commented 4 months ago

Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here 9df6741)

That was a "fat finger" haha. Noticed after it and hadn't squashed.

conceptofmind commented 4 months ago

@blester125 Do you have opinion on trafilatura?

conceptofmind commented 4 months ago

@blester125

Given the comments on parallelization.

The csv can be iterated through every 20k rows following something like the below:

csv_file = f'{opinions_string}.csv'

chunk_size = 20000 

# Read the CSV file in chunks
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
    # processing code

I am not sure if this is what you had in mind.

Harvard CAP handled loading the entire file initially and then repartitioning.

There are a few ways to handle the csv and I am not opinionated.

We can also preprocess the large csv into smaller files as a separate initial step.

There also should not be OOM or slowness as the file is only around 240 GB csv.

blester125 commented 4 months ago

The most straightforward way of parallelism would probably be something like this:

data = read_csv_generator(csv_file)
with mp.Pool(args.processors) as pool:
  dolma = pool.imap(convert_record_to_dolma, data)
  to_dolma(dolma, ...)

You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like dolma = itertools.chain(*dolma) before the to_dolma call.

Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient

conceptofmind commented 4 months ago

The most straightforward way of parallelism would probably be something like this:

data = read_csv_generator(csv_file)
with mp.Pool(args.processors) as pool:
  dolma = pool.imap(convert_record_to_dolma, data)
  to_dolma(dolma, ...)

You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like dolma = itertools.chain(*dolma) before the to_dolma call.

Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient

One dump is around 240 GB csv decompressed. Many mid-sized machines could likely load it and process in one go.

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

There is the option to pre-chunk the entire large dataset into smaller ones as well.

Very basic implementation:

opinions_string = "opinions-2023-12-04"

command = f"bunzip2 {opinions_string}.csv.bz2"
subprocess.run(command, capture_output=True, text=True, shell=True,)

csv_file = f'{opinions_string}.csv'

chunk_size = 20000 

output_dir = f'{opinions_string}'
os.makedirs(output_dir, exist_ok=True)

for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
    csv_file = os.path.join(output_dir, f'chunk_{i+1}.csv')
    chunk.to_csv(csv_file)
    print(f"Processed chunk {i+1}")

os.remove(f"{opinions_string}.csv")

This would speed up the mp at the cost of a preprocessing step.

Before I add something to the PR, I am going to comment here.

As an aside, I am trying to convince them to not upload in one giant csv.bz2 which can't be read in parallel for the major frameworks like Spark.

wildphoton commented 4 months ago

Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the process_row(). Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125

with open(file_path, "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Yield a dictionary for each row
    for row in reader:
        dolma_dict = process_row(row)
        yield dolma_dict
conceptofmind commented 4 months ago

Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the process_row(). Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125

with open(file_path, "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Yield a dictionary for each row
    for row in reader:
        dolma_dict = process_row(row)
        yield row

The above pandas read in chunks runs on my local desktop which does not even have close to enough RAM to load the entire 250GB into memory. I would not think that the pandas read csv would handle mem mapping that well to fit such a large dataset into memory but I could be wrong.

The read in chunks should function similarly to an iterator.

I agree that iterating across the entire file is fast enough. I also have no strong opinion on this.

I am perfectly fine with what works best. If we really want people to process this locally very easily maybe adding that additional preprocessing step to separate the single large csv file into multiple is worth it.

We could probably even "ensure" that each smaller file is around 128MB - 256MB.

blester125 commented 4 months ago

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

We don't really need to worry about this, we just need to output dolma files with the to_dolma function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files

conceptofmind commented 4 months ago

I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.

We don't really need to worry about this, we just need to output dolma files with the to_dolma function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files

Ok. I am going to comment some changes above before making changes to the PR. So that a few things can be ironed out first.

Docs about pandas csv chunksize:

iteratorbool, default False
Return TextFileReader object for iteration or getting chunks with get_chunk().

chunksizeint, optional
Number of lines to read from the file per chunk. Passing a value will cause the function to return a TextFileReader object for iteration. See the [IO Tools docs](https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking) for more information on iterator and chunksize.
conceptofmind commented 4 months ago

@blester125 Responded to each of the comments.

An updated example script for loading the entire file would look like:

import argparse
import logging
import re
from datetime import datetime

import pandas as pd

from licensed_pile.licenses import PermissiveLicenses
from licensed_pile.logs import configure_logging
from licensed_pile.write import to_dolma

def process_court_listener(file_path):
    """
    This function performs several operations on a given file:

    1. Reads the decompressed Court Listener (CL) data dump from the provided file path.
    2. Adds metadata and source information to the data.
    3. Drops unnecessary columns based on Harvard CAP input.
    4. Selects the first non-null value when iterating over columns based on CL documentation.
    5. Drops repeated XML/HTML columns.
    6. Extracts HTML/XML following Harvard CAP guidelines.
    7. Selects the first non-null value when iterating over extracted text and plain text columns.
    8. Handles null values based on text row.
    9. Writes out the processed data to dolma format.

    Parameters:
    file_path (str): The path to the file to be processed.
    """

    logger = logging.getLogger("court-listener-opinion")

    df = pd.read_csv(file_path)

    df["source"] = args.source_name
    df["added"] = datetime.utcnow().isoformat()

    # rename the column date_created -> created
    df = df.rename(columns={"date_created": "created"})

    # add the metadata column including the author_str, license, and url
    df["metadata"] = df.apply(
        lambda x: {
            "license": str(PermissiveLicenses.PD),
            "url": x["download_url"],
            "author": x["author_str"],
        },
        axis=1,
    )

    df = df.drop(
        columns=[
            "date_modified",
            "author_str",
            "per_curiam",
            "joined_by_str",
            "type",
            "sha1",
            "page_count",
            "local_path",
            "extracted_by_ocr",
            "author_id",
            "cluster_id",
            "download_url",
        ]
    )

    # For each row, select the first non-null value when iterating over columns in this order:
    # html_with_citations
    # html_columbia
    # html_lawbox
    # xml_harvard
    # html_anon_2020
    # html
    # plain_text
    # This follows the procedure in the Court Listener documentation.

    df["text"] = (
        df["html_with_citations"]
        .combine_first(df["html_columbia"])
        .combine_first(df["html_lawbox"])
        .combine_first(df["xml_harvard"])
        .combine_first(df["html_anon_2020"])
        .combine_first(df["html"])
    )

    # drop the columns from combine_first to avoid redundancy
    df = df.drop(
        columns=[
            "html",
            "html_anon_2020",
            "html_lawbox",
            "html_columbia",
            "xml_harvard",
            "html_with_citations",
        ]
    )

    # drop null values in the text column
    df = df.dropna(subset=["text"])

    # extract text from html and xml following Harvard CAP
    # They used r"<.+?>", ""
    df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))

    # Combine DataFrame objects by filling null values
    # with non-null values from the other selected DataFrames.
    df["text"] = df["text"].combine_first(df["plain_text"])

    # drop the plain text column
    df = df.drop(columns=["plain_text"])

    # drop null values in the text column
    df = df.dropna(subset=["text"])

    # return a dictionary for each row - dolma format
    return df.to_dict(orient="records")

def main(args):
    example = process_court_listener(args.input_file)
    to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)
    logger.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")

if __name__ == "__main__":
    logger = configure_logging("court-listener-opinion")

    parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )
    parser.add_argument(
        "--shard_size",
        default=1000,
        help="The number of documents to store in each shard.",
    )
    parser.add_argument(
        "--input_file",
        required=True,
        help="The path to the csv file to convert.",
    )
    parser.add_argument(
        "--output_file_base_name",
        default="courtlistener.jsonl.gz",
        help="The base name of the output file.",
    )
    parser.add_argument(
        "--source_name",
        default="SOURCE_NAME",
        help="The name of the source.",
    )
    args = parser.parse_args()
    main(args)

Text extraction comment related to trafilatura needs to be decided. If not using trafilatura can remove the other combine_first. This is what trafilatura text extraction looks like: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

Example output:

{"id": 6881, "created": "2010-04-25 05:22:21+00", "source": "SOURCE_NAME", "added": "2024-06-04T22:16:15.833924", "metadata": {"license": "Public Domain", "url": "http://www.ca5.uscourts.gov/opinions\\\\pub\\\\94/94-60337.CV0.wpd.pdf", "author": NaN}, "text": "United States Court of Appeals,\\n\\n ......."}
conceptofmind commented 4 months ago

If you want to iterate then something like this should work I believe:

    for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
        df = chunk
        ...
        yield df.to_dict(orient="records")
blester125 commented 4 months ago

Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go

conceptofmind commented 4 months ago

Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go

Ok I will add that to the main function.

conceptofmind commented 4 months ago

I tested something like this given your comment:

def iterate_over_chunks(file_path):
    for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
        yield chunk

def main(args):
    data = iterate_over_chunks(args.input_file)
    with mp.Pool(args.processors) as pool:
        dolma_chunks = pool.imap(process_court_listener, data)
        dolma = itertools.chain(*dolma_chunks)
        to_dolma(dolma, args.output_dir, args.output_file_base_name, args.shard_size

The speedup is not massively significant. Limitations of pandas may be shown here.

As noted by Zhenlin Xu. Single-process is still reasonable and can be done quickly as well. Even with just iterating alone.

Tried chaining from the iterable as well but this was even slower:

dolma_chunks = pool.imap(process_court_listener, ((chunk) for chunk in data))
dolma = itertools.chain.from_iterable(dolma_chunks)

Can always load into something like duckdb and coalesce instead of using spark. Requires a larger machine there though too.

blester125 commented 4 months ago

I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!

conceptofmind commented 4 months ago

I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!

Will do

wildphoton commented 4 months ago

I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125

conceptofmind commented 4 months ago

I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125

Hi,

Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP.

BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

wildphoton commented 4 months ago

Hi,

Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP. BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these. I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please? A stand-alone dataset won't be very helpful. Thanks.

conceptofmind commented 4 months ago

Hi, Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP. BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.

What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these.

In the issue, I stated a separate PR is required to handle additional processing. This requires the new CAP data dumps to have the opinionated fixes by CL applied and streamlined into the bulk CL data dump. Jack Cushman, Kevin Ramirez, Mike Lissner, Bill Palin, and I are working on the most efficient way to provide this set of fixes to the community.

I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please?

You can review the precision and F1 of different text extractors here:

| Model         | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura   | 0.913          | 0.883   | 0.989            | 0.957     |
| DOM Distiller | 0.894          | 0.858   | 0.983            | 0.959     |
| Web2Text      | 0.797          | 0.841   | 0.885            | 0.917     |
| Boilerpipe    | 0.908          | 0.834   | 0.973            | 0.946     |
| Dragnet       | 0.901          | 0.823   | 0.980            | 0.943     |
| BTE           | 0.796          | 0.817   | 0.927            | 0.936     |
| Newspaper3k   | 0.896          | 0.816   | 0.994            | 0.958     |
| news-please   | 0.895          | 0.815   | 0.994            | 0.958     |
| Goose3        | 0.899          | 0.810   | 0.999            | 0.940     |
| BoilerNet     | 0.840          | 0.798   | 0.944            | 0.895     |
| ExtractNet    | 0.858          | 0.791   | 0.963            | 0.911     |
| jusText       | 0.794          | 0.759   | 0.949            | 0.904     |
| lxml Cleaner  | 0.615          | 0.717   | 0.670            | 0.798     |
| html_text     | 0.567          | 0.683   | 0.506            | 0.667     |
| BS4           | 0.563          | 0.680   | 0.506            | 0.669     |
| inscriptis    | 0.557          | 0.673   | 0.483            | 0.649     |
| XPath Text    | 0.550          | 0.664   | 0.510            | 0.674     |

Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the above code, trafilatura.extract(filecontent, favor_precision=True).

A stand-alone dataset won't be very helpful. Thanks.

For clarification:

CAP has its own group chat and fixes CL has its own group chat and fixes There are parts in which both are not clearly communicated due to "differences", I would call it There are pieces of processing in which I am providing input directly to both of them for improving model training Mike and I are improving the documentation for CL so there are parts that are more clear to the community There are upstreams that need to be independently made from CAP to CL There are numerous instances of boilerplate and malformed text which we have already post-processed out

There is a bit of jumping around when trying to communicate all this

blester125 commented 4 months ago

How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything

conceptofmind commented 4 months ago

How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything

The most variability between CAP and CL is currently in "xml_harvard" which contains a lot of useful information. These documents require opinionated fixes from places like here: https://github.com/freelawproject/courtlistener/blob/a3bc7b472a1736bc0cb0e92cfa096dec12960e17/cl/corpus_importer/management/commands/harvard_opinions.py#L286

Which can be seen in a merged set of json here: https://github.com/freelawproject/opinionated/tree/main/data/harvard

This needs to be upstreamed in the new current CAP dumps to CL. We are working on this.

Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">.

There are numerous different instances of these cases. Once the upstreamed dumps are merged then CAP said this issue should not occur. I have written individual regex to cover as many of these edge cases as I could find. There are additionally child nodes that are being removed in relation to the edge case above as well.

There are also places in which I am writing a regex to improve tag '<jurissystem' -> '<div' standardization for extraction. These need to be handled independently as if you remove the incorrect tags then applying the upstream fixes between CAP and CL will have issues.

Other than that it is cleaning out standard things like boilerplate and OCR errors which we have been annotating.

There are other things that have been fixed and additional input I am waiting on from CAP/CL before adding to the next PR.

I messaged with Colin and will work on documenting this as clearly as possible.

conceptofmind commented 3 months ago

Additional changes will need to be made to drop the Harvard XML column and replace it with the new fixed HTML column from the updated dumps here: https://huggingface.co/datasets/conceptofmind/CAP

This will require the two datasets to be merged after appropriate column ordering from CAP.

This fixes the star pagination issue: Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">.

It would be preferred to get this merged into the actual CL dumps, but this is pending.