dataset preparation for training & test

hayleyshim commented 4 years ago

목표: 모델 학습을 위한 데이터셋 준비 및 테스트 데이터셋으로 점수 측정
결과물 (구체적): 현재 30 GB 단위 데이터 -> KITTI dataset 500MB ~ 1GB 데이터 추리기
due date: 3월 5일 (목요일)
리뷰어: preprossor 담당(덕호님, 유라님)

jwkanggist commented 4 years ago

데이터를 어떤방식으로 split을 할것인지 구체적으로 고민해주세요 그리고 train / valid / test 데이터가 파이프라인 어디에 각각 들어가시는지 다른 분들과 논의 해주세요

DownyBehind commented 4 years ago

train data : 유라님이 모델 학습용으로 꾸며놓으실 랩탑에서 사용할 학습용 데이터 셋 test data : 최종 nnstreamer input으로 들어갈 데이터 셋 [gstreamer input]

└── data/KITTI/object ├── training <-- 7481 train data | ├── image_2 <-- for visualization | ├── calib | ├── label_2 | ├── velodyne └── testing <-- 7580 test data ├── image_2 <-- for visualization ├── calib ├── velodyne

위의 구조가 KITTI Data set인데 전체 용량이 30기가 넘는것으로 알고 있어요. 기본적으로 training set과 testing set이 나눠져 있긴 한데 따로 Data split을 해야하는지 정확히 모르겠네요... 혹시 아시는 분 계신가요?

그리고 validation data가 따로 표기가 안되어 있는데 KITTI data set에 따로 관리하는지 확인할 필요가 있을 것 같아요.

추가로 최종 코드[nnstremaer] 평가 코드가 필요할 것 같은데 음... 이와 관련해서도 시간이 가능하시면 한 번 고민해주세요!

ddeokho commented 4 years ago

책에서 7(training):3(testing) 정도로 지정한다고 본 거 같은데 이건 testing이 더 많네요. 다시 한번 확인해 볼게요!

hayleyshim commented 4 years ago

Train Data : 분석 모델을 만들기 위한 학습용 데이터

유라님이 모델 학습용으로 꾸며놓으실 랩탑에서 사용할 학습용 데이터

Validation Data : 여러 분석 모델 중 어떤 모델이 적합한지 선택하기 위한 검증용 데이터

아직 pipeline 어느 부분에 넣을지 미정

Test Data : 최종적으로 선택된 분석 모델이 얼마나 잘 작동하는지 확인하기 위한 결과용 데이터

최종 nnstreamer input으로 들어갈 데이터 셋

[gstreamer input]

[Ref] ‘Squeezenet + KITTI’ Experiment

An Enhanced SqueezeNet Based Network for Real-Time Road-Object Segmentation(2019)

We dataset is come from SqueezeSeg, which download from the KITTI 3D object detection dataset [15] and transformed by spherical. This dataset collected about 1,0000 frames of point cloud images. All images are of the same size and format. Among them 8,000 images are used as the training images, and 2,000 are selected as the test images

SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud(2017) Our primary dataset is the converted KITTI dataset described above. We split the publicly available raw dataset into a training set with 8,057 frames and a validation set with 2,791 frames.
SqueezeDet : https://github.com/BichenWuUCB/squeezeDet

아직 split 기준 등에 대해서는 오늘까지 조금 더 고민해볼 생각입니다. train/validation/test dataset도 생각해보고요.

hayleyshim commented 4 years ago

현재 진행사항

위의 덕호님이 수정해준 코드로 0.6GB 사이즈의 임의 lidar sensor data 내 txt 파일을 읽어오는 것 확인. 즉, 향후 데이터 파일 개수 늘려도 현재 덕호님이 작성한 gstreamer 내 코드는 확장성 있는 것으로 파악됨
train/validation/test split
- 기존의 squeezenet +KITTI dataset 로 구현된 SqueezeDet 에서 split 할 수 있는 코드 참고하여 자동으로 dataset train/validation/test 파일 split 해주는 방향으로 생각 중
- 아래 코드에서 split은 numpy 의 permutation 으로 data를 random하게 split 해주는데 permutation 함수를 보면 shuffle로 섞어줌

[Reference Code - random train_validataion split.py] import numpy as np

image_set_dir = './KITTI/ImageSets' trainval_file = image_set_dir+'/trainval.txt' train_file = image_set_dir+'/train.txt' val_file = image_set_dir+'/val.txt'

idx = [] with open(trainval_file) as f: for line in f: idx.append(line.strip()) f.close()

idx = np.random.permutation(idx)

train_idx = sorted(idx[:len(idx)/2]) val_idx = sorted(idx[len(idx)/2:])

with open(train_file, 'w') as f: for i in train_idx: f.write('{}\n'.format(i)) f.close()

with open(val_file, 'w') as f: for i in val_idx: f.write('{}\n'.format(i)) f.close()

print 'Trainining set is saved to ' + train_file print 'Validation set is saved to ' + val_file

[Reference Code - permutation function]

def permutation(self, object x): if isinstance(x, (int, np.integer)): arr = np.arange(x) self.shuffle(arr) return arr

    arr = np.asarray(x)

    # shuffle has fast-path for 1-d
    if arr.ndim == 1:
        # Return a copy if same memory
        if np.may_share_memory(arr, x):
            arr = np.array(arr)
        self.shuffle(arr)
        return arr

    # Shuffle index array, dtype to ensure fast path
    idx = np.arange(arr.shape[0], dtype=np.intp)
    self.shuffle(idx)
    return arr[idx]

hayleyshim commented 4 years ago

*현재 계획

NNStreamer와 연결되는 Gstreamer inputdata : Gstreamer input data로 lidar sensor data의 txt 파일 넣기(덕호님 코드 테스트완료)
모델학습용 data : 현재 깃헙에 업로드 된 Sqeezenet 모델학습을 위한 image/label 데이터 준비. 파일에 train/valid로 split 하는 부분이 구현되어 있음(모델학습 환경구축 후 테스트예정)

*3월08일 일요일 미팅 전, 현재 가장 시급한 이슈사항인 gstreamer와 nnstreamer 연결 부분에 집중할 것

jwkanggist commented 4 years ago

네 좋습니다. Split된 데이터는 tf.data를 이용해서 파이프라인을 구성하면 좋을것 같아요 Tf.record로 한번 변환해서 gcp bucket에 올려서 사용하면 빠른 공유가 가능할 것같습니다

hayleyshim commented 4 years ago

향후 진행 사항 : 모델 훈련을 위한 tf.record + tf.data 파이프라인 구축

hayleyshim commented 4 years ago

향후 계획 : tf.record + tf.data 파이프라인 구축

nnstreamer-preprocessor / nnstreamer

dataset preparation for training & test #6