As discussed, here is a suggestion for a backbone for handling datasets (opening an issue to keep a trace of that).
distinction (a bit arbitrary) between ocr, asr, and perhaps after htr datasets.
proposed organisation is similar as hipe (since we need to convert...):
root/
└── data/
└── datasets/
└── ocr/
└── original/ <= the original data, source of conversion. Let's push only what's needed (e.g. no images in case there are).
└── dataset1-alias/
└── dataset2-alias/
└── ...
└── converted/ <= converted data in the form of `jsonl` files
└── dataset1-alias/
└── dataset2-alias/
└── ...
└── asr/
└── original/
└── converted/
└── htr/
....
└── input/ <= here I would see same as in `converted` for each dataset type, but organised in a way that fit the processing (?)
└── output/ <= I would keep it as mass noun here (sing)
└── lib/
....
ICDAR already moved in.
I would suggest a Makefile to handle the conversion process? Since there might be a lot, this could be handy, also to remember things afterwards...
As discussed, here is a suggestion for a backbone for handling datasets (opening an issue to keep a trace of that).
ocr
,asr
, and perhaps afterhtr
datasets.Makefile
to handle the conversion process? Since there might be a lot, this could be handy, also to remember things afterwards...