undertheseanlp / underthesea

Underthesea - Vietnamese NLP Toolkit
http://undertheseanlp.com
GNU General Public License v3.0
1.37k stars 271 forks source link

Corpus CP_Vietnamese-VLC 2022 #621

Closed rain1024 closed 1 year ago

rain1024 commented 1 year ago

Description

The project "Corpus CP_Vietnamese-VLC 202" aims to create a corpus of Vietnamese legal texts, with a focus on commercial law. The corpus will include a wide range of resources, including official government websites, legal databases, and academic journals. The text will be extracted and preprocessed from these sources, and annotated with metadata such as the title, date, and source of each document. The corpus will be stored in a database and made available to researchers and other interested parties through a website or API. The project will also include analysis of the corpus using natural language processing and text mining techniques, and the publication of findings and recommendations based on the analysis. The corpus will be regularly reviewed and updated to ensure that it remains accurate and up-to-date. The goal of the project is to provide a comprehensive and useful resource for researchers and practitioners in the field of Vietnamese commercial law.

Plan

  1. Research Vietnamese law sources
    • [x] Identify official government websites
    • [x] Search for legal databases and academic journals
    • [x] Review relevant literature and previous studies

References

  1. Identify the scope of the corpus

    • [x] Identify the language or languages of the corpus: Vietnamese, (may be English)
    • [x] Determine the genre or types of texts to include: Law documents
    • [x] Specify the time period of the texts: 2022
    • [x] Determine the size of the corpus: ?
  2. Extract and preprocess text

    • [x] Use web scrapers or text extractors to obtain text from sources
    • [x] Remove formatting and structural elements that are not relevant to the corpus
  3. Annotate text

    • [x] Add metadata such as the title, date, and source of each document

In this version, I don't add any metadata.

  1. Create database

    • [x] Use a database management system to store and organize the text and metadata for the corpus
  2. Make corpus available

    • [x] Consider publishing the corpus online or distributing it to researchers and other interested parties
    • [x] Create a website or API, or make the corpus available for download in a standard format
  3. Maintain and update corpus

    • [ ] Periodically review and update the corpus to ensure that it remains accurate and up-to-date
    • [ ] Add new documents, update existing ones, and remove any that are no longer relevant
  4. Analyze and explore the corpus

    • [ ] Use natural language processing and text mining techniques to analyze and understand the content of the corpus
    • [ ] Explore trends and patterns in the data, such as the most common topics or the most frequently cited laws
  5. Publish findings and make recommendations

    • [ ] Write a report or paper outlining the findings of your analysis and any recommendations for further research or action
    • [ ] Share your results with the legal community, policymakers, and other stakeholders who may be interested in the corpus
  6. Maintain and update the corpus

    • [ ] Continually review and update the corpus as new laws are enacted or existing ones are amended
    • [ ] Consider adding additional resources or expanding the scope of the corpus to include other areas of law or related disciplines

References