Abstract

딥러닝 방식은 bbox regression & 문자와 비문자 분류 그리고 위치 regression. regression은 이러한 방식들의 bbox를 찾는데 주요한 역할을 한다 텍스트/비문자 예측은 전체 위치 정보를 포함하는 일종의 의미 분할로 간주될수 있기 때문에 필수적인것은 아니다. 그러나 이미지 속 텍스트는 종종 각 이미지에 매우 가깝게 위치해서 의미분할을 통해 분류하기가 어렵다

semantic segmentation vs instance segmentation 의 차이는??? Semantic Segmentation은 분할의 기본 단위를 클래스로 하여, 동일한 클래스에 해당하는 사물을 예측 마스크 상에 동일한 색상으로 표시합니다. 반면 Instance Segmentation은 분할의 기본 단위를 사물로 하여, 동일한 클래스에 해당하더라도 서로 다른 사물에 해당하면 이들을 예측 마스크 상에 다른 색상으로 표시합니다. 아래 그림에서 Semantic Segmentation과 Instance Segmentation 간의 차이를 극명하게 확인할 수 있습니다

Introduction

2단계로 구분 text detection & recognition detection은 localization 이라고도 한다. 딥러닝 발달로 여러 기법 CTPN, TextBoxes, SegLink, EAT 대부분의 방식은 Fully conv net 그리고 최소 2가지 방식의 prediction을 사용한다

text/non-text classification : TextBoxes, SegLink, EAST
Location regression : TextBoxes, SegLink, CTPN(reference boxes) // EAST(absolute)

문자/비문자 구별 예측은 regression 결과의 confidence로 사용될 뿐 아니라 segmentation 스코어맵으로 사용된다. 위치정보 뿐 아니라 바운딩박스를 직접 획득하는데 사용된다. 그러므로 regression은 필수적인 것이 아니다. 문자와 비문자 구별은 때로 글자들 사이를 구별하기가 힘들다 ← semantic segmentation instance 수준의 segmentation은 더 나아진 방식이다 pixel link는 DNN, 2개 방식으로 훈련된 text/non-text prediction, and link prediction. predicted positive links 에 의해 Connected Components 로 지정

Related Work

2.1 Semantic & Instance Segmentation

인스턴스 세그멘테이션이 더 어려운이유는 각 픽셀의 카테고리 뿐 아니라 인스턴스간의 차이도 구별해야해서 FCIS (Liet al. 2016) extends the idea of position-sensitive prediction in R-FCN (Dai et al. 2016). Mask R-CNN (He et al. 2017a) changes the RoIPooling in Faster R-CNN (Ren et al. 2015) to RoIAlign detection 그리고 segmentation이 같은 딥러닝모델을 사용하므로 segmentation 결과는 detection의 결과에 매우 의존한다...

2.2 Segmentation -based Text Detection

문자 /비문자 그리고 캐릭터 클래스, 캐릭터 linking orientations ...결론적으로 이러한 방법들은 후처리에 시간이 많이걸리고 성능이 불만족스럽다

2.3 Regression -based Text Detection

TextBoxes는 문자 특화된 SSD, 정형적이지 않은 모양의 커널과 큰 가로세로 비율의 anchors 를 채택한다.

3. Detecting Text via Instance Segmentation

3.2 Linking Pixels Together

positive pixels 는 positive links 를 사용해서 그룹화하여 탐지된 텍스트 인스턴스를 나타내는 CC 컬렉션을 생성. linking 과정은 disjoint-set 데이터 구조로 실현

3.3 Extraction of BB

PixelLink 에서는 scene 텍스트의 방향에 제한이 없다는 점을 언급할 필요가 있다 bb 는 인스턴스 세그멘테이션에서 직접적으로 획득된다, location regression 에 비해서

3.4 post filtering after segmentation

width, height, area and aspect ratio 필터링 선택 기준은 훈련데이터셋 으로 계산해서 99프로까지를 감안해서 예를 들면 10이라는 숫자는 약 99정도의 텍스트 인스턴스 의 짧은 면의 길이가 10이상이기 때문이다.

6. Analysis and Discussion

VGGnet을 사용하는 모든 방법들 중에서 PixelLink는 더 적은 데이터로 더 빠른 훈련이 가능하다 더 나은 성능을 보이는 것도
receptive fields 의 차이
text detection은 일반적인 물체 탐지보다 더 단순하다. rely more on low-level texture feature and loss on high level semantic feature.

7 Conclusion and Future Work

bounding boxes of detected text are directly extracted from the segmentation result, without performing location regression. since smaller receptive fields are required and easier tasks are to be learned.

parksunwoo / memo-archive

PixelLink: Detecting Scene Text via Instance Segmentation #2