r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Legal Documents #21

Open blester125 opened 1 year ago

blester125 commented 1 year ago

Domain: Legal

StellaAthena commented 1 year ago

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

shayne-longpre commented 12 months ago

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).

conceptofmind commented 11 months ago

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).

We could use perspectives api to list the scores for each sentence.

conceptofmind commented 11 months ago

Also should keep Cushman in the loop for this? Talk to Peter as well.

craffel commented 11 months ago

Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.

conceptofmind commented 11 months ago

Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.

Got you. Sounds good.

conceptofmind commented 11 months ago

I have access to all of case law and some scripts to get the data. Can CC John Nay

wildphoton commented 10 months ago

@conceptofmind Hi Enrico, if you have not started to work on Pile of Law, I will take it.

StellaAthena commented 10 months ago

@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.

Looking through the paper, it seems that about two thirds of the data comes from:

  1. CourtListener Opinions, CourtListener Docket Entries and Court Filings
  2. U.S. Board of Veterans’ Appeals Decisions
  3. Atticus Contracts
  4. Edgar Contract
  5. U.S. State Codes
wildphoton commented 10 months ago

@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.

Looking through the paper, it seems that about two thirds of the data comes from:

  1. CourtListener Opinions, CourtListener Docket Entries and Court Filings
  2. U.S. Board of Veterans’ Appeals Decisions
  3. Atticus Contracts
  4. Edgar Contract
  5. U.S. State Codes

Thanks for the note! I am starting from the first source.

conceptofmind commented 10 months ago

Supposed to be getting into contact with Jack Cushman soon over case law

conceptofmind commented 9 months ago

In contact with Jack Cushman about Case Law Access Project. The data will be open in March for release. I will have all the scripts and data prepared for it then.

conceptofmind commented 9 months ago

Wrote an updated parser for UPSTO.

baberabb commented 8 months ago

Wrote an updated parser for UPSTO.

Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?

conceptofmind commented 8 months ago

Wrote an updated parser for UPSTO.

Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?

I have parsed docs from the official site.

It is likely worth doing both.

conceptofmind commented 8 months ago

CAP is done and I uploaded a post-processed sample to HF

conceptofmind commented 8 months ago

I am working on finishing Court Listener soon with Ramid.

conceptofmind commented 8 months ago

We transcribed oral arguments, cleaned the opinion datasets, and are looking into the other sections of POL now.

wildphoton commented 8 months ago

Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!

conceptofmind commented 7 months ago

Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.

I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

storytracer commented 7 months ago

Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.

conceptofmind commented 5 months ago

@craffel Going to bring this back up here.

conceptofmind commented 5 months ago

I uploaded the full previous CL dumps to HF here last month:

https://huggingface.co/datasets/conceptofmind/opinions-2023-12-04 https://huggingface.co/datasets/conceptofmind/opinions-2023-08-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-10-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-07-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-05-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-04-30 https://huggingface.co/datasets/conceptofmind/opinions-2023-03-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-02-28 https://huggingface.co/datasets/conceptofmind/opinions-2023-01-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-12-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-11-30 https://huggingface.co/datasets/conceptofmind/opinions-2022-09-30 https://huggingface.co/datasets/conceptofmind/opinions-2022-08-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-08-02

These should NOT be used. You have the option to use them but they contain mostly duplicates of the same data from the previous dump and other previous data which should be excluded. These dumps contain all of the columns. Not just the plain text.

Only this most recent dump should be used. Each one includes a bit more additional data: https://huggingface.co/datasets/conceptofmind/opinions_raw_recent

craffel commented 5 months ago

Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?

conceptofmind commented 5 months ago

Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?

Yes, the code will be posted and this is not the finalized data. This data is just to ensure that duplicated work is not done. Everything being done with CL and Harvard CAP is already open-sourced anyway and you can see it.

conceptofmind commented 5 months ago

CC @wildphoton so that you do not need to run all the extraction. You will be duplicating the data 15 times. The single correct dump is already extracted and I am post-processing it.

conceptofmind commented 5 months ago

@craffel I am going to make multiple PRs but want to stem the issue related to deduplicated data first so opening that one now.

wildphoton commented 5 months ago

@conceptofmind Thanks for sharing the info.

craffel commented 5 months ago

We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.

conceptofmind commented 5 months ago

We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.

I have already done this. I just need to upload it and add the code. The previous example of overlapping data is: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

conceptofmind commented 5 months ago

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.

The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge

This is pending upload given input from Mike Lissner and Jack Cushman.

There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

craffel commented 5 months ago

@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.

conceptofmind commented 5 months ago

@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.

I will add initial basic processing code for the stated above and make the changes to the PR.

I am waiting on final opinions from CAP/CL and will open another PR after that.

wildphoton commented 5 months ago

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.

The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge

This is pending upload given input from Mike Lissner and Jack Cushman.

There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

conceptofmind commented 5 months ago

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done. The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge This is pending upload given input from Mike Lissner and Jack Cushman. There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done. The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge This is pending upload given input from Mike Lissner and Jack Cushman. There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

That is the correct documentation.

It says "from best to worst". The best being html_with_citations. The worst being plain_text. The ordering of columns to use is listed there and has been additionally confirmed by CAP and CL to me:

The best approach is to choose the first of these fields that is populated, 
according to the following order (from best to worst):
- html_with_citations is a special field that is populated by parsing one of the above fields for citations, generating an HTML file with hyperlinked citations. All items should eventually have this field, though it can be empty initially or if the cross-linker crashes. In general, this is the field that is used to generate pages on CourtListener and the one we recommend.
- html_columbia will be populated if we got the content from the Columbia collaboration.
- html_lawbox will be populated if we got the content from the Lawbox donation.
- xml_harvard will be populated if the source was Harvard's Caselaw Access Project. This field has a lot of data but is inferior to others due to being created by OCR instead of by humans.
- html_anon_2020 will be populated if we got the content from our anonymous source in 2020.
- html will be populated if we got the opinion from a court's website as a Word Perfect or HTML document, or if we got the opinion from Resource.org, which provides HTML documents.
- plain_text will be populated if we got the opinion from a court's website as a PDF or Microsoft Word document.

The plain text does not contain the numerous fixes and opinionated pre-processing that Harvard CAP and CL have spent time adding during collection. If only plain_text is used it misses much of the data contained in the other columns. For these reasons and more not stated, it is typically ranked lowest. I imagine any structured government data is going to look pretty good!

Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

I said that a different PR would need to be opened after for the additional post-processing fixes.

Quoted here:

I am waiting on final opinions from CAP/CL and will open another PR after that.

The current PR is to get the correct ordering of the columns and the text extracted. It is best to ensure all of Brian's comments are resolved. I am finishing those first.

The next PR that will be opened will contain fixes to the structured HTML/XML as well as things such as boilerplate removal, handling erroneous OCR errors, cleaning, etc. I have been working with a team to label any instances of boilerplate that need to be removed.

There are still additional fixes that need to be upstreamed from CAP to CL and I am helping them with it now. For example, there are new HTML documents from CAP that are not yet added to CL and need to be processed with specific CL code.

I will try to add all of these fixes to the next upcoming PR based on input given to me by CAP and CL.

Thanks.

wildphoton commented 5 months ago

@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.

I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

conceptofmind commented 5 months ago

@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon. I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

Forwarding this from the PR:

You can review the precision and F1 of different text extractors here:

| Model         | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura   | 0.913          | 0.883   | 0.989            | 0.957     |
| DOM Distiller | 0.894          | 0.858   | 0.983            | 0.959     |
| Web2Text      | 0.797          | 0.841   | 0.885            | 0.917     |
| Boilerpipe    | 0.908          | 0.834   | 0.973            | 0.946     |
| Dragnet       | 0.901          | 0.823   | 0.980            | 0.943     |
| BTE           | 0.796          | 0.817   | 0.927            | 0.936     |
| Newspaper3k   | 0.896          | 0.816   | 0.994            | 0.958     |
| news-please   | 0.895          | 0.815   | 0.994            | 0.958     |
| Goose3        | 0.899          | 0.810   | 0.999            | 0.940     |
| BoilerNet     | 0.840          | 0.798   | 0.944            | 0.895     |
| ExtractNet    | 0.858          | 0.791   | 0.963            | 0.911     |
| jusText       | 0.794          | 0.759   | 0.949            | 0.904     |
| lxml Cleaner  | 0.615          | 0.717   | 0.670            | 0.798     |
| html_text     | 0.567          | 0.683   | 0.506            | 0.667     |
| BS4           | 0.563          | 0.680   | 0.506            | 0.669     |
| inscriptis    | 0.557          | 0.673   | 0.483            | 0.649     |
| XPath Text    | 0.550          | 0.664   | 0.510            | 0.674     |

Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the code in the PR, trafilatura.extract(filecontent, favor_precision=True).

The current PR will use Trafilatura for handling the HTML/XML extraction.

I am not sure if anyone has worked on collecting any updates to Pile of Law. It is likely worth contacting Peter Henderson in that regard.