Open conceptofmind opened 4 months ago
A side note. If I am to add Spark processing I will need to make additional adjustments to other parts of the scripts.
The combines are separated by what needs to be extracted and the plain text. This was in case a different text extractor was chosen. Trafilatura seemed to over filter the plain text if not done independently.
I have run this semi-processed version and uploaded parquets.
Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?
Thanks for this. I don't see any reason to have both versions of the processing pipelines, can you merge this into the existing one?
Sounds good. I will merge it into the existing one.
I listed out several changes that should be made, currently the output format is not correct.
I can't see needing something like SPARK for this dataset. If there are OoM or slowness issues due to pandas reading the whole CSV at once it should be easy to update to 1) Read the CSV line-by-line via a generator that uses the csv from stdlib 2) parallelized the conversion from csv record to dolma record with
multiprocessing.Pool.imap
3) pass the resulting iterator to theto_dolma
function.That should avoid reading the whole file at once and leverage the embarrassingly parallel nature of the task.
I tried to keep it as close to the original as possible. There are numerous different changes I would have personally made. I can integrate those instead as well.
Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here https://github.com/r-three/common-pile/pull/83/commits/9df6741fb085c6d93a3acc604c20eb1397f8679b)
I will resolve each of the comments. However, the text extractor should be decided upon before I do so. This is a bare minimal reproduction of what was done previously in Spark but following what was already there in the original PR for this project/dolma format.
Also, if you are going to keep working on the same branch after a merge (which you should avoid), you should prefer rebasing onto main instead of merging it into your branch (which happened here 9df6741)
That was a "fat finger" haha. Noticed after it and hadn't squashed.
@blester125 Do you have opinion on trafilatura?
@blester125
Given the comments on parallelization.
The csv can be iterated through every 20k rows following something like the below:
csv_file = f'{opinions_string}.csv'
chunk_size = 20000
# Read the CSV file in chunks
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
# processing code
I am not sure if this is what you had in mind.
Harvard CAP handled loading the entire file initially and then repartitioning.
There are a few ways to handle the csv and I am not opinionated.
We can also preprocess the large csv into smaller files as a separate initial step.
There also should not be OOM or slowness as the file is only around 240 GB csv.
The most straightforward way of parallelism would probably be something like this:
data = read_csv_generator(csv_file)
with mp.Pool(args.processors) as pool:
dolma = pool.imap(convert_record_to_dolma, data)
to_dolma(dolma, ...)
You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like dolma = itertools.chain(*dolma)
before the to_dolma
call.
Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient
The most straightforward way of parallelism would probably be something like this:
data = read_csv_generator(csv_file) with mp.Pool(args.processors) as pool: dolma = pool.imap(convert_record_to_dolma, data) to_dolma(dolma, ...)
You could either have the read_csv_generator output individual rows (and then pandas probably isn't the best fit) or you could have it output chunks like you mentioned (as long as passing the pandas dataframe to the new process isn't extremely expensive) and would just need to add something like
dolma = itertools.chain(*dolma)
before theto_dolma
call.Assuming there isn't too much contention wrt to the disk, I would think this could saturate a multicore machine. TBH I don't think this dataset is big enough that we need to be 100 percent efficient
One dump is around 240 GB csv decompressed. Many mid-sized machines could likely load it and process in one go.
I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.
There is the option to pre-chunk the entire large dataset into smaller ones as well.
Very basic implementation:
opinions_string = "opinions-2023-12-04"
command = f"bunzip2 {opinions_string}.csv.bz2"
subprocess.run(command, capture_output=True, text=True, shell=True,)
csv_file = f'{opinions_string}.csv'
chunk_size = 20000
output_dir = f'{opinions_string}'
os.makedirs(output_dir, exist_ok=True)
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=chunk_size)):
csv_file = os.path.join(output_dir, f'chunk_{i+1}.csv')
chunk.to_csv(csv_file)
print(f"Processed chunk {i+1}")
os.remove(f"{opinions_string}.csv")
This would speed up the mp at the cost of a preprocessing step.
Before I add something to the PR, I am going to comment here.
As an aside, I am trying to convince them to not upload in one giant csv.bz2 which can't be read in parallel for the major frameworks like Spark.
Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the process_row()
. Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125
with open(file_path, "r") as csvfile:
reader = csv.DictReader(csvfile)
# Yield a dictionary for each row
for row in reader:
dolma_dict = process_row(row)
yield dolma_dict
Use pandas, even read in chunks, won't avoid having all rows in memory IMO. If we expect people will run our code, shall we make the code more memory friendly instead of speed efficient? Previously, I read and process the file into dolma line by line on the fly and you can do what ever you want in the
process_row()
. Although this is not parallelized, it is not too slow to iterate the whole file. How do you think @conceptofmind @blester125with open(file_path, "r") as csvfile: reader = csv.DictReader(csvfile) # Yield a dictionary for each row for row in reader: dolma_dict = process_row(row) yield row
The above pandas read in chunks
runs on my local desktop which does not even have close to enough RAM to load the entire 250GB into memory. I would not think that the pandas read csv would handle mem mapping that well to fit such a large dataset into memory but I could be wrong.
The read in chunks should function similarly to an iterator.
I agree that iterating across the entire file is fast enough. I also have no strong opinion on this.
I am perfectly fine with what works best. If we really want people to process this locally very easily maybe adding that additional preprocessing step to separate the single large csv file into multiple is worth it.
We could probably even "ensure" that each smaller file is around 128MB - 256MB.
I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.
We don't really need to worry about this, we just need to output dolma files with the to_dolma
function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files
I just spoke to luca and currently, dolma doesn't support writing to parquet. So I would need to make some additional changes to handle the dolma format when doing the chunked pandas writes.
We don't really need to worry about this, we just need to output dolma files with the
to_dolma
function. Huggingface suppports the format well enough that it can automatically make and cache parquet files. We don't need to actually publish parquet files
Ok. I am going to comment some changes above before making changes to the PR. So that a few things can be ironed out first.
Docs about pandas csv chunksize:
iteratorbool, default False
Return TextFileReader object for iteration or getting chunks with get_chunk().
chunksizeint, optional
Number of lines to read from the file per chunk. Passing a value will cause the function to return a TextFileReader object for iteration. See the [IO Tools docs](https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking) for more information on iterator and chunksize.
@blester125 Responded to each of the comments.
An updated example script for loading the entire file would look like:
import argparse
import logging
import re
from datetime import datetime
import pandas as pd
from licensed_pile.licenses import PermissiveLicenses
from licensed_pile.logs import configure_logging
from licensed_pile.write import to_dolma
def process_court_listener(file_path):
"""
This function performs several operations on a given file:
1. Reads the decompressed Court Listener (CL) data dump from the provided file path.
2. Adds metadata and source information to the data.
3. Drops unnecessary columns based on Harvard CAP input.
4. Selects the first non-null value when iterating over columns based on CL documentation.
5. Drops repeated XML/HTML columns.
6. Extracts HTML/XML following Harvard CAP guidelines.
7. Selects the first non-null value when iterating over extracted text and plain text columns.
8. Handles null values based on text row.
9. Writes out the processed data to dolma format.
Parameters:
file_path (str): The path to the file to be processed.
"""
logger = logging.getLogger("court-listener-opinion")
df = pd.read_csv(file_path)
df["source"] = args.source_name
df["added"] = datetime.utcnow().isoformat()
# rename the column date_created -> created
df = df.rename(columns={"date_created": "created"})
# add the metadata column including the author_str, license, and url
df["metadata"] = df.apply(
lambda x: {
"license": str(PermissiveLicenses.PD),
"url": x["download_url"],
"author": x["author_str"],
},
axis=1,
)
df = df.drop(
columns=[
"date_modified",
"author_str",
"per_curiam",
"joined_by_str",
"type",
"sha1",
"page_count",
"local_path",
"extracted_by_ocr",
"author_id",
"cluster_id",
"download_url",
]
)
# For each row, select the first non-null value when iterating over columns in this order:
# html_with_citations
# html_columbia
# html_lawbox
# xml_harvard
# html_anon_2020
# html
# plain_text
# This follows the procedure in the Court Listener documentation.
df["text"] = (
df["html_with_citations"]
.combine_first(df["html_columbia"])
.combine_first(df["html_lawbox"])
.combine_first(df["xml_harvard"])
.combine_first(df["html_anon_2020"])
.combine_first(df["html"])
)
# drop the columns from combine_first to avoid redundancy
df = df.drop(
columns=[
"html",
"html_anon_2020",
"html_lawbox",
"html_columbia",
"xml_harvard",
"html_with_citations",
]
)
# drop null values in the text column
df = df.dropna(subset=["text"])
# extract text from html and xml following Harvard CAP
# They used r"<.+?>", ""
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))
# Combine DataFrame objects by filling null values
# with non-null values from the other selected DataFrames.
df["text"] = df["text"].combine_first(df["plain_text"])
# drop the plain text column
df = df.drop(columns=["plain_text"])
# drop null values in the text column
df = df.dropna(subset=["text"])
# return a dictionary for each row - dolma format
return df.to_dict(orient="records")
def main(args):
example = process_court_listener(args.input_file)
to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)
logger.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")
if __name__ == "__main__":
logger = configure_logging("court-listener-opinion")
parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
parser.add_argument(
"--output_dir",
default="data/courtlistener/v0/documents",
help="Where the dolma formatted data goes.",
)
parser.add_argument(
"--shard_size",
default=1000,
help="The number of documents to store in each shard.",
)
parser.add_argument(
"--input_file",
required=True,
help="The path to the csv file to convert.",
)
parser.add_argument(
"--output_file_base_name",
default="courtlistener.jsonl.gz",
help="The base name of the output file.",
)
parser.add_argument(
"--source_name",
default="SOURCE_NAME",
help="The name of the source.",
)
args = parser.parse_args()
main(args)
Text extraction comment related to trafilatura needs to be decided. If not using trafilatura can remove the other combine_first
. This is what trafilatura text extraction looks like: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project
Example output:
{"id": 6881, "created": "2010-04-25 05:22:21+00", "source": "SOURCE_NAME", "added": "2024-06-04T22:16:15.833924", "metadata": {"license": "Public Domain", "url": "http://www.ca5.uscourts.gov/opinions\\\\pub\\\\94/94-60337.CV0.wpd.pdf", "author": NaN}, "text": "United States Court of Appeals,\\n\\n ......."}
If you want to iterate then something like this should work I believe:
for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
df = chunk
...
yield df.to_dict(orient="records")
Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go
Yeah, I think reading in chunks and passing the chunks to a multiprocessing pool would be the way to go
Ok I will add that to the main function.
I tested something like this given your comment:
def iterate_over_chunks(file_path):
for chunk in pd.read_csv(file_path, chunksize=args.shard_size):
yield chunk
def main(args):
data = iterate_over_chunks(args.input_file)
with mp.Pool(args.processors) as pool:
dolma_chunks = pool.imap(process_court_listener, data)
dolma = itertools.chain(*dolma_chunks)
to_dolma(dolma, args.output_dir, args.output_file_base_name, args.shard_size
The speedup is not massively significant. Limitations of pandas may be shown here.
As noted by Zhenlin Xu. Single-process is still reasonable and can be done quickly as well. Even with just iterating alone.
Tried chaining from the iterable as well but this was even slower:
dolma_chunks = pool.imap(process_court_listener, ((chunk) for chunk in data))
dolma = itertools.chain.from_iterable(dolma_chunks)
Can always load into something like duckdb and coalesce instead of using spark. Requires a larger machine there though too.
I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!
I see, I think this version makes sense, even if the speedup isn't the best. Lets get all the changes into the PR and get it merged!
Will do
I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125
I have tried to use BeautifulSoup for html/xml parsing and it seems working well. I will share the processed data soon. @conceptofmind @blester125
Hi,
Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP.
BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.
Hi,
Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP. BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.
What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these. I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please? A stand-alone dataset won't be very helpful. Thanks.
Hi, Pre-processing steps need to be applied to the HTML/XML before text extraction. CourtListener contains malformed HTML/XML that requires upstreamed content from CAP. BS4 uses its own LXML backend extraction which is typically worse quality than Trafilatura extraction when precision is set True. I have already processed all of the data and uploaded the newest dump. I am waiting on HF to remove my rate limits so that I can finish it.
What kind of preprocessing steps are needed? What upstream content from CAP does it require? The current PR does not show these.
In the issue, I stated a separate PR is required to handle additional processing. This requires the new CAP data dumps to have the opinionated fixes by CL applied and streamlined into the bulk CL data dump. Jack Cushman, Kevin Ramirez, Mike Lissner, Bill Palin, and I are working on the most efficient way to provide this set of fixes to the community.
I just applied BS4 text extraction on the raw data and it works great. If Trafilatura works better, could you include it and all the preprocessing you mentioned in the CR please?
You can review the precision and F1 of different text extractors here:
| Model | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura | 0.913 | 0.883 | 0.989 | 0.957 |
| DOM Distiller | 0.894 | 0.858 | 0.983 | 0.959 |
| Web2Text | 0.797 | 0.841 | 0.885 | 0.917 |
| Boilerpipe | 0.908 | 0.834 | 0.973 | 0.946 |
| Dragnet | 0.901 | 0.823 | 0.980 | 0.943 |
| BTE | 0.796 | 0.817 | 0.927 | 0.936 |
| Newspaper3k | 0.896 | 0.816 | 0.994 | 0.958 |
| news-please | 0.895 | 0.815 | 0.994 | 0.958 |
| Goose3 | 0.899 | 0.810 | 0.999 | 0.940 |
| BoilerNet | 0.840 | 0.798 | 0.944 | 0.895 |
| ExtractNet | 0.858 | 0.791 | 0.963 | 0.911 |
| jusText | 0.794 | 0.759 | 0.949 | 0.904 |
| lxml Cleaner | 0.615 | 0.717 | 0.670 | 0.798 |
| html_text | 0.567 | 0.683 | 0.506 | 0.667 |
| BS4 | 0.563 | 0.680 | 0.506 | 0.669 |
| inscriptis | 0.557 | 0.673 | 0.483 | 0.649 |
| XPath Text | 0.550 | 0.664 | 0.510 | 0.674 |
Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the above code, trafilatura.extract(filecontent, favor_precision=True)
.
A stand-alone dataset won't be very helpful. Thanks.
For clarification:
CAP has its own group chat and fixes CL has its own group chat and fixes There are parts in which both are not clearly communicated due to "differences", I would call it There are pieces of processing in which I am providing input directly to both of them for improving model training Mike and I are improving the documentation for CL so there are parts that are more clear to the community There are upstreams that need to be independently made from CAP to CL There are numerous instances of boilerplate and malformed text which we have already post-processed out
There is a bit of jumping around when trying to communicate all this
How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything
How variable is the markup in each example for this dataset? Part of the point of splitting work by datasets was to have someone work with the data and figure out patterns in the way a particular source formats their data so we can create a better text extractor/clean up for that data source instead of just dumping it into an off the shelf tool that tries to handle everything
The most variability between CAP and CL is currently in "xml_harvard" which contains a lot of useful information. These documents require opinionated fixes from places like here: https://github.com/freelawproject/courtlistener/blob/a3bc7b472a1736bc0cb0e92cfa096dec12960e17/cl/corpus_importer/management/commands/harvard_opinions.py#L286
Which can be seen in a merged set of json here: https://github.com/freelawproject/opinionated/tree/main/data/harvard
This needs to be upstreamed in the new current CAP dumps to CL. We are working on this.
Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">
.
There are numerous different instances of these cases. Once the upstreamed dumps are merged then CAP said this issue should not occur. I have written individual regex to cover as many of these edge cases as I could find. There are additionally child nodes that are being removed in relation to the edge case above as well.
There are also places in which I am writing a regex to improve tag '<jurissystem' -> '<div'
standardization for extraction. These need to be handled independently as if you remove the incorrect tags then applying the upstream fixes between CAP and CL will have issues.
Other than that it is cleaning out standard things like boilerplate and OCR errors which we have been annotating.
There are other things that have been fixed and additional input I am waiting on from CAP/CL before adding to the next PR.
I messaged with Colin and will work on documenting this as clearly as possible.
Additional changes will need to be made to drop the Harvard XML column and replace it with the new fixed HTML column from the updated dumps here: https://huggingface.co/datasets/conceptofmind/CAP
This will require the two datasets to be merged after appropriate column ordering from CAP.
This fixes the star pagination issue: Malformed HTML/XML can be seen here: <span class="star-pagination" number="848" pagescheme="<span class="citation" data-id="4498419">.
It would be preferred to get this merged into the actual CL dumps, but this is pending.
The columns in CL need to be coalesced in the order:
The
plain_text
only contains a smaller low-quality subset of the total data. Using only theplain_text
column misses much of the information that is available in the other columns. Not all rows contain information. To get the largest quantity of proper CL data possible without additionally scraping the.txt
files it is required to merge the columns in the order above.Harvard in cold cases and CAP chose to use regex to do initial text extraction. This can be replaced with something like Trafilatura in one line.
I tried to keep this similar to what zhenlinx had. I did not remove their script.
The script:
I iterated through the
jsonl.gz
file to make sure it looked ~normal.Additional filtering and boilerplate removal will need to be added in another PR.
This leaves you with the "correct" initial CL data following guidelines by Harvard CAP and CL/Mike Lissner.
@blester125