Can't custom models be replaced?

lhj5426 commented 1 month ago

I thought I could use my custom-trained model specifically for Japanese comics by changing the model name, as shown in the picture

"But when I ran the software again, I found that it re-downloaded the original model, the one with the name before the change." "By doing this, the software overwrote my renamed custom model. Can the functionality for adding custom models be included? This would allow me to easily use and test the models I have trained myself."

https://github.com/xulihang/ImageTrans-docs/issues/711 "I trained a YOLOv8 model specifically for text recognition in Japanese full-color CG comics."

"I am also training a new model specifically for Japanese black-and-white and full-color comics." If custom models can be used, it would be very convenient."

lhj5426 commented 1 month ago

"Developer, the model you trained is not specifically trained for Japanese comics, so it performs poorly in text localization for Japanese manga."

"I enjoy reading comics and have specifically trained my models using images from exhentai."

ogkalu2 commented 1 month ago

Are your custom models yolo models ? How many classes does your bubble detector have ?

Detection combines the results from the bubble detector and text segmenter. Anything not detected by both gets filtered out.

The app has issues with text outside bubbles but it's not because of a lack of training for Manga (most of the dataset is Manga) but a lack of consistent annotation for text outside bubbles in the dataset for the bubble detector. I have 2 classes for the bubble detector - text_free and text_bubble

lhj5426 commented 1 month ago

"Of course, I only started learning how to use YOLOv8 to train models online in the last two months." "I trained using the yolov8x.pt and yolov8s.pt models on a 3090 GPU."

nc: 6 # Number of classes names: ['balloon','qipao','fangkuai','changfangtiao','kuangwai','other'] "I set up 6 classes and trained with the default 640x640 resolution. However, I only used 3 classes in practice." balloon qipao changfangtiao "I used this open-source program to manually label data day and night, often spending 16 hours a day labeling." https://github.com/CVHub520/X-AnyLabeling "I somewhat achieved the filtering of background onomatopoeia in scenes of two people in bed in Japanese manga. Background onomatopoeia is used to enhance the atmosphere and doesn't need to be recognized or translated. It was for this purpose that I started training my own model, and I have achieved an acceptable level of filtering."

"As a beginner, although I haven't achieved 100%, compared to other similar programs available, my results are already quite good." "I mainly use https://github.com/CVHub520/X-AnyLabeling to view inference result predictions using the ONNX format."

lhj5426 commented 1 month ago

"This is the inference result for Japanese manga scenes of two people after converting your comic-speech-bubble-detector model to ONNX. As you can see, many background onomatopoeia words have been detected."

"This image shows the filtering effect of my own trained model on onomatopoeia in Japanese manga scenes with two people. As you can see, compared to your model, mine has better-targeted training for Japanese text. That's why I asked if custom models can be used."

lhj5426 commented 1 month ago

"Using the developer's software directly, some of it still gets recognized."

lhj5426 commented 1 month ago

"And there are also cases of missed detections, as shown in the image."

lhj5426 commented 1 month ago

"And the results for this type of content are even less ideal. This is the download link for the model I trained with yolov8s.pt, converted to ONNX format, specifically for full-color vertical text manga. Could this type of model be supported? Since GitHub does not support files larger than 10 MB, I used OneDrive for hosting."

Please provide the OneDrive link if you have it so that it can be reviewed.

https://alumnialbany-my.sharepoint.com/:f:/g/personal/planetrav_alumni_albany_edu/EnnR5gQvo2lMn162JRJJtrgBw1O0PWOG53NR26mSl2gNsg?e=3kyICe

lhj5426 commented 1 month ago

"Alternatively, could the developer optimize the current model for recognizing Japanese manga text?"

lhj5426 commented 1 month ago

"Such images even fail to achieve complete recognition."

ogkalu2 commented 1 month ago

Since they're yolo models, it's possible to switch but i have to make some changes to the code. I have some questions:

When there is no bubble but text exists, what class is the text predicted under ? does this class predict only this kind of text?
How are you filtering background onomatopoeia ? Does it get predicted by a class ? If so, which class ?
How many images did you train your model with ?
which model is better ? medium or small ?

lhj5426 commented 1 month ago

Unable to recognize anything like this, I didn't use any tags to indicate as shown in the video

https://github.com/user-attachments/assets/147c9f9a-468f-4874-be08-ca58142eb55d

How to exclude onomatopoeic words? When manually tagging, I don't tag images like this at all. I only tag the required content text. After tagging, iterative training is carried out repeatedly to achieve the goal of filtering out onomatopoeic words. For example, in an image like this, there is no dialogue tagged for Xu Xiyuan at all. It is the same as question one, when there is no recognition, no tags are written

yolov8s.pt, the model placed in OneDrive, used over 10 GB and more than 8,400 images.

yolov8x.pt, which is still in training, used 27 GB and more than 24,000 images.

Regarding the recognition of comic text positions, I think the S model and the X model are actually similar. The only difference is the training speed and duration. I rent GPU services online for training, which costs 35-68 RMB per day, because I don't have GPUs like the 3090 or 4090

Here is the translation of your provided text into English:

"Currently, using over 24,000 images on a 3090 to train the yolov8x.pt model takes about several dozen minutes to an hour per training epoch. In a day, I can train around 20 or fewer epochs. I have already trained for more than 150 hours, completing 155 epochs. Since my pocket money is exhausted, I am waiting to continue training next time."

lhj5426 commented 1 month ago

For the pure long strips and those with squares, I use the 'balloon' tag. For speech bubbles, I use the 'qipao' tag. For internal monologue descriptions with jagged bubbles, I also use the 'balloon' tag. Horizontal text is tagged with 'changfangtiao'. I mainly use these three tags to label the images, and I exclude onomatopoeic words during the labeling process

ogkalu2 commented 1 month ago

Can you upload your difficult images to OneDrive or wherever? I want to do some testing.

ogkalu2 commented 1 month ago

Would it also be possible to share your 8400+ image dataset with me ? Also, i need the non-Onnx version of your model

lhj5426 commented 1 month ago

https://alumnialbany-my.sharepoint.com/:f:/g/personal/planetrav_alumni_albany_edu/Eqj-2jSi979LlJ65sV5wUaIBUvHVon7N5KRi6VWcHM6fJQ?e=Lq11AS

Images, annotation files, and the PT model are all in this link

lhj5426 commented 1 month ago

All the images are from adult comics. Please open them carefully and try to do so when no one is around. Please be advised

ogkalu2 commented 1 month ago

I have something preliminary working. Can you send a few raw images (ie without any annotations) that are difficult for my model so I can test ?

lhj5426 commented 1 month ago

Sure, but since the overall results were not satisfactory, I've selected a representative manga where your model didn't perform well. Here is the link for you to download it yourself.

https://e-hentai.org/g/2966365/494a78fbe7

https://hitomi.la/cg/%E3%82%A2%E3%83%80%E3%83%AB%E3%83%88dvd%E3%82%B3%E3%83%BC%E3%83%8A%E3%83%BC%E3%82%92%E8%A6%8B%E3%81%A6%E3%81%9F%E3%82%8D%E3%82%8A%E3%81%A3%E5%A8%98%E3%81%AB%E3%82%A4%E3%82%BF%E3%82%BA%E3%83%A9-%E6%97%A5%E6%9C%AC%E8%AA%9E-2986275.html#1

The text recognition in this manga series has a lot of missed detections. Let's use this as a representative example. 異世界エルフ発情の魔眼 https://hitomi.la/search.html?%E7%95%B0%E4%B8%96%E7%95%8C%E3%82%A8%E3%83%AB%E3%83%95%E7%99%BA%E6%83%85%E3%81%AE%E9%AD%94%E7%9C%BC

ogkalu2 commented 1 month ago

I have pushed changes that merges the results of our models. Just download and run the update, you don't need to do anything else. Since it merges results, some onomatopoeic words will register. Even for just manga, merging results is in my opinion currently much better than using your model alone. Your model tends to predict text free of bubbles as single lines which makes rendering much harder (box is too small). When results are merged, it often groups multiple lines into one box. For example: Screenshot (264) Screenshot (265)

lhj5426 commented 1 month ago

Haha, thanks! My model is specifically trained for another software, designed to handle single strips based on the software's features. The software itself has a merging function, and processing single strips makes OCR faster. Regardless, I want to express my gratitude! I'll go update it and give it a try.