Open blester125 opened 1 year ago
I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.
I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.
@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).
I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.
@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).
We could use perspectives api to list the scores for each sentence.
Also should keep Cushman in the loop for this? Talk to Peter as well.
Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.
Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.
Got you. Sounds good.
I have access to all of case law and some scripts to get the data. Can CC John Nay
@conceptofmind Hi Enrico, if you have not started to work on Pile of Law, I will take it.
@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.
Looking through the paper, it seems that about two thirds of the data comes from:
@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.
Looking through the paper, it seems that about two thirds of the data comes from:
- CourtListener Opinions, CourtListener Docket Entries and Court Filings
- U.S. Board of Veterans’ Appeals Decisions
- Atticus Contracts
- Edgar Contract
- U.S. State Codes
Thanks for the note! I am starting from the first source.
Supposed to be getting into contact with Jack Cushman soon over case law
In contact with Jack Cushman about Case Law Access Project. The data will be open in March for release. I will have all the scripts and data prepared for it then.
Wrote an updated parser for UPSTO.
Wrote an updated parser for UPSTO.
Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?
Wrote an updated parser for UPSTO.
Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?
I have parsed docs from the official site.
It is likely worth doing both.
CAP is done and I uploaded a post-processed sample to HF
I am working on finishing Court Listener soon with Ramid.
We transcribed oral arguments, cleaned the opinion datasets, and are looking into the other sections of POL now.
Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!
Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!
The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.
I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.
Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.
@craffel Going to bring this back up here.
I uploaded the full previous CL dumps to HF here last month:
https://huggingface.co/datasets/conceptofmind/opinions-2023-12-04 https://huggingface.co/datasets/conceptofmind/opinions-2023-08-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-10-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-07-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-05-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-04-30 https://huggingface.co/datasets/conceptofmind/opinions-2023-03-31 https://huggingface.co/datasets/conceptofmind/opinions-2023-02-28 https://huggingface.co/datasets/conceptofmind/opinions-2023-01-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-12-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-11-30 https://huggingface.co/datasets/conceptofmind/opinions-2022-09-30 https://huggingface.co/datasets/conceptofmind/opinions-2022-08-31 https://huggingface.co/datasets/conceptofmind/opinions-2022-08-02
These should NOT be used. You have the option to use them but they contain mostly duplicates of the same data from the previous dump and other previous data which should be excluded. These dumps contain all of the columns. Not just the plain text.
Only this most recent dump should be used. Each one includes a bit more additional data: https://huggingface.co/datasets/conceptofmind/opinions_raw_recent
Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?
Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?
Yes, the code will be posted and this is not the finalized data. This data is just to ensure that duplicated work is not done. Everything being done with CL and Harvard CAP is already open-sourced anyway and you can see it.
CC @wildphoton so that you do not need to run all the extraction. You will be duplicating the data 15 times. The single correct dump is already extracted and I am post-processing it.
@craffel I am going to make multiple PRs but want to stem the issue related to deduplicated data first so opening that one now.
@conceptofmind Thanks for sharing the info.
We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.
We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.
I have already done this. I just need to upload it and add the code. The previous example of overlapping data is: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project
@conceptofmind Thanks for sharing the info.
- If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
- I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
- I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?
The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.
The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge
This is pending upload given input from Mike Lissner and Jack Cushman.
There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.
@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.
@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.
I will add initial basic processing code for the stated above and make the changes to the PR.
I am waiting on final opinions from CAP/CL and will open another PR after that.
@conceptofmind Thanks for sharing the info.
- If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
- I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
- I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?
The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.
The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge
This is pending upload given input from Mike Lissner and Jack Cushman.
There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.
@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text
is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document
according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text
as you said and the code only combine it with cleaned HTML columns.
@conceptofmind Thanks for sharing the info.
- If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
- I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
- I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?
The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done. The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge This is pending upload given input from Mike Lissner and Jack Cushman. There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.
@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think
plain_text
is of the lowest quality? It sounds actually the cleanest one if it isthe opinion from a court's website as a PDF or Microsoft Word document
according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on theplain_text
as you said and the code only combine it with cleaned HTML columns.@conceptofmind Thanks for sharing the info.
- If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
- I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
- I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?
The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done. The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge This is pending upload given input from Mike Lissner and Jack Cushman. There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.
@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think
plain_text
is of the lowest quality? It sounds actually the cleanest one if it isthe opinion from a court's website as a PDF or Microsoft Word document
according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on theplain_text
as you said and the code only combine it with cleaned HTML columns.
That is the correct documentation.
It says "from best to worst". The best being html_with_citations
. The worst being plain_text
. The ordering of columns to use is listed there and has been additionally confirmed by CAP and CL to me:
The best approach is to choose the first of these fields that is populated,
according to the following order (from best to worst):
- html_with_citations is a special field that is populated by parsing one of the above fields for citations, generating an HTML file with hyperlinked citations. All items should eventually have this field, though it can be empty initially or if the cross-linker crashes. In general, this is the field that is used to generate pages on CourtListener and the one we recommend.
- html_columbia will be populated if we got the content from the Columbia collaboration.
- html_lawbox will be populated if we got the content from the Lawbox donation.
- xml_harvard will be populated if the source was Harvard's Caselaw Access Project. This field has a lot of data but is inferior to others due to being created by OCR instead of by humans.
- html_anon_2020 will be populated if we got the content from our anonymous source in 2020.
- html will be populated if we got the opinion from a court's website as a Word Perfect or HTML document, or if we got the opinion from Resource.org, which provides HTML documents.
- plain_text will be populated if we got the opinion from a court's website as a PDF or Microsoft Word document.
The plain text does not contain the numerous fixes and opinionated pre-processing that Harvard CAP and CL have spent time adding during collection. If only plain_text
is used it misses much of the data contained in the other columns. For these reasons and more not stated, it is typically ranked lowest. I imagine any structured government data is going to look pretty good!
Also, I did not see your CR did any processing on the
plain_text
as you said and the code only combine it with cleaned HTML columns.
I said that a different PR would need to be opened after for the additional post-processing fixes.
Quoted here:
I am waiting on final opinions from CAP/CL and will open another PR after that.
The current PR is to get the correct ordering of the columns and the text extracted. It is best to ensure all of Brian's comments are resolved. I am finishing those first.
The next PR that will be opened will contain fixes to the structured HTML/XML as well as things such as boilerplate removal, handling erroneous OCR errors, cleaning, etc. I have been working with a team to label any instances of boilerplate that need to be removed.
There are still additional fixes that need to be upstreamed from CAP to CL and I am helping them with it now. For example, there are new HTML documents from CAP that are not yet added to CL and need to be processed with specific CL code.
I will try to add all of these fixes to the next upcoming PR based on input given to me by CAP and CL.
Thanks.
@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel
The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.
I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.
@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel
The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon. I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.
Forwarding this from the PR:
You can review the precision and F1 of different text extractors here:
| Model | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura | 0.913 | 0.883 | 0.989 | 0.957 |
| DOM Distiller | 0.894 | 0.858 | 0.983 | 0.959 |
| Web2Text | 0.797 | 0.841 | 0.885 | 0.917 |
| Boilerpipe | 0.908 | 0.834 | 0.973 | 0.946 |
| Dragnet | 0.901 | 0.823 | 0.980 | 0.943 |
| BTE | 0.796 | 0.817 | 0.927 | 0.936 |
| Newspaper3k | 0.896 | 0.816 | 0.994 | 0.958 |
| news-please | 0.895 | 0.815 | 0.994 | 0.958 |
| Goose3 | 0.899 | 0.810 | 0.999 | 0.940 |
| BoilerNet | 0.840 | 0.798 | 0.944 | 0.895 |
| ExtractNet | 0.858 | 0.791 | 0.963 | 0.911 |
| jusText | 0.794 | 0.759 | 0.949 | 0.904 |
| lxml Cleaner | 0.615 | 0.717 | 0.670 | 0.798 |
| html_text | 0.567 | 0.683 | 0.506 | 0.667 |
| BS4 | 0.563 | 0.680 | 0.506 | 0.669 |
| inscriptis | 0.557 | 0.673 | 0.483 | 0.649 |
| XPath Text | 0.550 | 0.664 | 0.510 | 0.674 |
Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the code in the PR, trafilatura.extract(filecontent, favor_precision=True)
.
The current PR will use Trafilatura for handling the HTML/XML extraction.
I am not sure if anyone has worked on collecting any updates to Pile of Law. It is likely worth contacting Peter Henderson in that regard.
Domain: Legal
[ ] Pile of Law
[ ] Case Law Access Project
[ ] US Congressional Documents Digitized records of congress proceedings. See here: https://www.govinfo.gov/app/collection/cdoc/118/sdoc/all Some of the data is text and some is just PDF, from a quick look it seems like there are a decent number of tables in the PDFs (which generally don't have text versions available).