Closed VtlNmnk closed 2 years ago
@VtlNmnk I used free version of Label Studio for annotation. The dump from label studio is very easy to convert to SDMG-R format.
Use the Json-min export from LabelStudio
hi amit , pls can u help me with steps to convert to SDMG-R format. Iam using my custom document dataset ,want to annotate required key-value pair and convert to format supported by SDMGR.
@pushpalatha1405 we discussed this over email. If you feel comfortable, you can close the issue
@amitbcp, if you wrote to everyone here what to do with the exported data from the annotation tool, then similar questions will not appear in the future. As for me, I haven't finished labeling my data yet and haven't tried converting it yet.
@pushpalatha1405 we discussed this over email. If you feel comfortable, you can close the issue
pk amith iam closing the issue
@VtlNmnk once the data is dumped in json-min format from the label studio, a simple script to convert the data to SDMGR format works. Label Studio Dump has all the necessary information we require to convert to SDMGR format.
The most important thing would be to configure the labelling format for label studio. The sample one which works well for conversion to SDMGR is :
<View>
<Image name="image" value="$ocr" zoomControl="true" rotateControl="true" zoom="true"/>
<RectangleLabels name="label" toName="image" strokeWidth="2">
<Label value=“class_1” background="Aqua"/>
<Label value=“class”_2 background="#D4380D"/>
</RectangleLabels>
<View visibleWhen="region-selected" style="width: 100%; display: block">
<Header value="Write transcription:"/>
<TextArea name="transcription" toName="image" editable="true" perRegion="true" required="true" maxSubmissions="1" rows="5" strokeWidth="2"/>
</View>
</View>
Here we can annotate the OCR and specify the labels. To add more classes, just repeat the <label value>
as per your dataset
@amitbcp pls can u share a one labelled object format(any one annotated filed json object tag) of label-studio and its equivalent conversion to sdmgr format would be really helpful.
yes, I figured out how to do this part. I use these settings to supplement the "wild receipts" dataset.
<View>
<Image name="image" value="$ocr" zoom="true"/>
<Labels name="label" toName="image">
<Label value="Ignore" background="#FFA39E"/>
<Label value="Store_name_value" background="#D4380D"/>
<Label value="Store_name_key" background="#FFC069"/>
<Label value="Store_addr_value" background="#AD8B00"/>
<Label value="Store_addr_key" background="#D3F261"/>
<Label value="Tel_value" background="#389E0D"/>
<Label value="Tel_key" background="#5CDBD3"/>
<Label value="Date_value" background="#096DD9"/>
<Label value="Date_key" background="#ADC6FF"/>
<Label value="Time_value" background="#9254DE"/>
<Label value="Time_key" background="#F759AB"/>
<Label value="Prod_item_value" background="#FFA39E"/>
<Label value="Prod_item_key" background="#D4380D"/>
<Label value="Prod_quantity_value" background="#FFC069"/>
<Label value="Prod_quantity_key" background="#AD8B00"/>
<Label value="Prod_price_value" background="#D3F261"/>
<Label value="Prod_price_key" background="#389E0D"/>
<Label value="Subtotal_value" background="#5CDBD3"/>
<Label value="Subtotal_key" background="#096DD9"/>
<Label value="Tax_value" background="#ADC6FF"/>
<Label value="Tax_key" background="#9254DE"/>
<Label value="Tips_value" background="#F759AB"/>
<Label value="Tips_key" background="#FFA39E"/>
<Label value="Total_value" background="#D4380D"/>
<Label value="Total_key" background="#FFC069"/>
<Label value="Others" background="#AD8B00"/>
</Labels>
<Rectangle name="bbox" toName="image" strokeWidth="3"/>
<Polygon name="poly" toName="image" strokeWidth="3"/>
<TextArea name="transcription" toName="image" editable="true" perRegion="true" required="true" maxSubmissions="1" rows="5" placeholder="Recognized Text" displayMode="region-list"/>
</View>
But it is not yet clear to me which script to use for conversion from Json-min to SDMG-R format. It must be one of these converters, right?
@VtlNmnk no, the script is a custom one not provided here. Let me share that tommorrow here
@VtlNmnk Since Label studio stores coordinates in a different scale tan SDMGR we need those conversions too . Here is the code I used :
with open("manual_ls.json","r") as f : # from label-studio json-min dump
ls = json.loads(f.read())
global_tags = []
for dl in ls :
filename = "./annotate_dl/" + dl["ocr"].split("=")[-1]
labels = dl['label']
transcriptions = dl['transcription']
annotations = []
for label,text in zip(labels,transcriptions) :
tag = label['rectanglelabels'][0]
ocr = text['text'][0]
ocr = ocr.replace(" ","").lower()
original_width = label['original_width']
original_height = label['original_height']
x,y = label['x'],label['y']
width,height = label['width'],label['height']
# default
x0 = (x*original_width)/100
y0 = (y*original_height)/100
w = (width*original_width)/100
h = (height*original_height)/100
x1=x0+w
y1=y0+h
box = [x0,y0,x1,y0,x1,y1,x0,y1]
annt_dict = {'box':box,'text':ocr,'label':class_dict[tag]} #converting label to int index
annotations.append(annt_dict)
global_tags.append((class_dict[tag],tag)) # to calculate statistics later
manual_dl = {"file_name":filename,"height": original_height, "width": original_width, "annotations":annotations}
with open('./annotate_dl/manual_ls_exp_syn_v1.txt','a') as convert_file:
convert_file.write(json.dumps(manual_dl))
convert_file.write("\n")
Thanks amith for the script.
@pushpalatha1405 @VtlNmnk did it work for you ? In that case we can close it ?
Thanks very much amith for intial script u sent for creatin annotation format acceptable by mmocr model.Through which iam able create my custom dataset and custom model and use mmocr in fledge for our project.
U can close the issue.
Hi @amitbcp @VtlNmnk @pushpalatha1405, thanks for the great discussion! Would you summarize this discussion into a tutorial? We are planning a tutorial section in our documentation just similar to what MMDetection did to improve developer experience. If you would make a PR, your contribution can be acknowledged and help more people. :)
Sure Thong i would make time and contribute to PR
@pushpalatha1405 Thanks! Just create a file named make_dataset.md
under docs/
in the PR and we will help you organize it.
@pushpalatha1405 @VtlNmnk did it work for you ? In that case we can close it ?
yes, Label Studio and script are working. We can close the issue.
Hi Thong, I have created a file named make_dataset.md(enabled PR also). I need add contents. Do have specific topics i should cover and organize in the file. I will add contents to the file by the end of this week.
regards, Pushpalatha M
@VtlNmnk Since Label studio stores coordinates in a different scale tan SDMGR we need those conversions too . Here is the code I used :
with open("manual_ls.json","r") as f : # from label-studio json-min dump ls = json.loads(f.read()) global_tags = [] for dl in ls : filename = "./annotate_dl/" + dl["ocr"].split("=")[-1] labels = dl['label'] transcriptions = dl['transcription'] annotations = [] for label,text in zip(labels,transcriptions) : tag = label['rectanglelabels'][0] ocr = text['text'][0] ocr = ocr.replace(" ","").lower() original_width = label['original_width'] original_height = label['original_height'] x,y = label['x'],label['y'] width,height = label['width'],label['height'] # default x0 = (x*original_width)/100 y0 = (y*original_height)/100 w = (width*original_width)/100 h = (height*original_height)/100 x1=x0+w y1=y0+h box = [x0,y0,x1,y0,x1,y1,x0,y1] annt_dict = {'box':box,'text':ocr,'label':class_dict[tag]} #converting label to int index annotations.append(annt_dict) global_tags.append((class_dict[tag],tag)) # to calculate statistics later manual_dl = {"file_name":filename,"height": original_height, "width": original_width, "annotations":annotations} with open('./annotate_dl/manual_ls_exp_syn_v1.txt','a') as convert_file: convert_file.write(json.dumps(manual_dl)) convert_file.write("\n")
why was this not included in a Pull Request ?
Hey! I really appreciate your excellent work! I want to add some of my own examples of annotated receipts to the wildreceipt dataset to train the model with my dataset. Is there any annotation tool available? Or is there a converter from other formats?