impresso / llm-transcript-postcorrection

A repository for preliminary work on HTR/OCR/ASR post-correction based on GPT models.
8 stars 1 forks source link

data structure #1

Closed e-maud closed 1 year ago

e-maud commented 1 year ago

As discussed, here is a suggestion for a backbone for handling datasets (opening an issue to keep a trace of that).

root/
└── data/
  └── datasets/
    └── ocr/
      └── original/ <= the original data, source of conversion. Let's push only what's needed (e.g. no images in case there are).
         └── dataset1-alias/
         └── dataset2-alias/
         └── ... 
      └── converted/ <= converted data in the form of `jsonl` files
         └── dataset1-alias/
         └── dataset2-alias/
         └── ...
    └── asr/
      └── original/
      └── converted/
    └── htr/
          ....
 └── input/ <= here I would see same as in `converted` for each dataset type, but organised in a way that fit the processing (?)
 └── output/ <= I would keep it as mass noun here (sing)

└── lib/
         ....