data structure - Githubissues

As discussed, here is a suggestion for a backbone for handling datasets (opening an issue to keep a trace of that).

distinction (a bit arbitrary) between ocr, asr, and perhaps after htr datasets.
proposed organisation is similar as hipe (since we need to convert...):

root/
└── data/
  └── datasets/
    └── ocr/
      └── original/ <= the original data, source of conversion. Let's push only what's needed (e.g. no images in case there are).
         └── dataset1-alias/
         └── dataset2-alias/
         └── ... 
      └── converted/ <= converted data in the form of `jsonl` files
         └── dataset1-alias/
         └── dataset2-alias/
         └── ...
    └── asr/
      └── original/
      └── converted/
    └── htr/
          ....
 └── input/ <= here I would see same as in `converted` for each dataset type, but organised in a way that fit the processing (?)
 └── output/ <= I would keep it as mass noun here (sing)

└── lib/
         ....

ICDAR already moved in.
I would suggest a Makefile to handle the conversion process? Since there might be a lot, this could be handy, also to remember things afterwards...

impresso / llm-transcript-postcorrection

data structure #1