[Dataset Request] kcbert

jeongukjae / tfds-korean

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

https://jeongukjae.github.io/tfds-korean/

Apache License 2.0

20 stars 3 forks source link

[Dataset Request] kcbert #20

Open jeongukjae opened 3 years ago

jeongukjae commented 3 years ago

Dataset Information

Dataset Name:
Prefered code name(e.g. korean_chatbot_qa_data): kcbert
Dataset description:
Homepage: https://github.com/Beomi/KcBERT
Citation:

Additional Context

이거 추가해두면 엄청 유용하게 쓸 수 있을 것 같다!!

jeongukjae commented 3 years ago

Kaggle 데이터 download가 dl_manager로 바로 되는구나... 엄청 쉽게 구현하겠다 ㅎㅎ

jeongukjae commented 3 years ago

구현은 엄청 쉬운데, 파싱에 두시간 반정도 걸린다... 16인치 맥북 성능으로 초당 10000개 example 처리, 약 총 9천만 example => 대략 2.5시간

    def _generate_examples(self, path):
        return (
            beam.Create([str(path)])
            | "read_all_lines" >> beam.io.ReadAllFromText()
            | "creating_examples" >> beam.Map(lambda line: (line, {"comment": line}))
        )

    def _generate_examples(self, path):
        with path.open() as f:
            for index, line in enumerate(f):
               yield f"kcbert-{index}", {"comment": line.strip()}

두 방식 다 가능해보이는데, 둘 다 암튼 너무 오래 걸린다..

병렬로 처리해야하나?
beam + dataflow로 처리한다음 GCS에 저장해서 requester pays 설정하고 try_gcs 옵션 만들까?

jeongukjae commented 3 years ago

@harrydrippin 요거 캐글에서 데이터셋 받아와서 12GB 정도 텍스트 파일을 처리하는데, 하나의 파일만 존재하고, 텍스트 파일의 한 라인이 하나의 example 이예요. 근데 2시간정도 처리해야하면 너무 오래걸리는 것 같아서 조금 방법을 고민 중인데, 혹시 아이디어 있으신가요?

jeongukjae commented 3 years ago

이거 개인 gcp 계정 세팅할때 필요한 게 좀 있어서 나중에 해볼게요 ㅠㅠ ㅋㅋㅋ 일단 tfds-korean gs bucket 만 생성해놓았어요. requester pay등등 확인해보는 중이예요.