2023/04/14 ~ 2023/04/20

danbi5228 commented 1 year ago

2023/04/20 pm 9:30

danbi5228 commented 1 year ago

assign roles -s 0414 -c 13.2~13.2.2 13.2.3 13.2.4~5

njs03332 commented 1 year ago

	0	1	2
member	김유리	한단비	주선미
chapter	13.2~13.2.2	13.2.3	13.2.4~5

danbi5228 commented 1 year ago

13.2.3 텐서플로 프로토콜 버퍼

Example 프로토콜 버퍼: TFRecord 파일에서 사용하는 전형적인 주요 프로토콜 버퍼로, 데이터셋에 있는 하나의 샘플을 표현
- 이름을 가진 특성의 리스트를 가지고 있음. 각 특성은 바이트 스트링/실수/정수의 리스트 중 하나


# Example 프로토콜 버퍼 정의

syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; } # [packed = true] ; 반복적인 수치 필드에 사용
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; }; # 특성이름과 특성값을 매핑한 딕셔너리
message Example { Features features = 1; };

# 앞서 Person과 동일하게 표현한 Example 객체 생성 및 TFRecord 파일에 저장

from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

### 앞서 정의된 내용과 같이 Example 은 하나의 Featrues 객체를 가짐
person_example = Example(
    features=Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alice"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        }))

### 생성한 person_example 버퍼를 이용해 SerializeToString 메서드를 호출하고 결과를 TFRecord 파일에 저장
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

njs03332 commented 1 year ago

13.2 TFRecord 포맷

대용량 데이터를 효율적으로 저장하고 읽기 위해 텐서플로가 선호하는 포맷
크기가 다른 연속된 이진 레코드를 저장하는 단순한 이진 포맷

각 레코드는 레코드 길이, 길이가 올바른지 체크하는 CRC 체크섬, 실제 데이터, 데이터를 위한 CRC 체크섬으로 구성


with tf.io.TFRecordWriter("my_data.tfrecord") as f:
f.write(b"데이터를 작성할 때는 바이트 객체로 작성하자")
f.write(b"그리고 이건 두번째 레코드가 된다")

filepaths = ["my_data.tfrecord"] dataset = tf.data.TFRecordDataset(filepaths) for item in dataset: print(item)

- 기본적으로 TFRecordDataset는 파일을 하나씩 차례로 읽음
  - `num_parallel_reads`를 지정하여 여러 파일에서 레코드를 번갈아 읽을 수 있음
  - 앞서 csv 파일에 적용했던 것처럼 `list_files()`와 `interleave()`을 사용하여 동일한 결과를 얻을 수 있음

### 13.2.1 압축된 TFRecord 파일
- TFRecord 파일을 압축해야할 때가 있음 (특히 네트워크를 통해 읽어야 하는 경우)
  - options 매개변수 이용
```python
options = tf.io.TFRecordOptions(compression_type = "GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    [...]

# 압축된 파일 읽기
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],
                                  compression_type="GZIP")

givitallugot commented 1 year ago

13.2.4 Example 프로토콜 버퍼를 읽고 파싱하기

tf.data.TFRecordDataset: Example 프로토콜 버퍼 읽기

tf.io.parse_single_example: Example 파싱 => 이 함수는 두 개의 매개변수 필요 1. 직렬화된 데이터를 담은 문자열 스칼라 텐서 2. 각 특성에 대한 설명


feature_description = {
"name" = tf.io.FixedLenFeature([], tf.string, default_value = ""),
"id" = tf.io.FixedLenFeature([], tf.int64, default_value=0),
"emails" = tf.io.VarLenFeature(tf.string),
}

for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]): parsed_example = tf.io.parse_single_sample(serialized_example, feature_description)


- 고정 길이 특성은 보통의 텐서로 파싱되지만, 가변 길이 특성은 희소 텐서로 파싱됨
- tf.sparse.to_dense()로 희소 텐서를 밀집 텐서로 변환할 수 있음
- BytesList는 직렬화된 객체를 포함해 원하는 어떤 이진 데이터도 포함할 수 있음
- tf.io.serialize_tensor()를 사용하여 어떤 텐서라도 직렬화하고 결과 바이트 스트링을 BytesList 특성에 넣어 저장할 수 있음

## 13.2.5 SequenceExample 프로토콜 버퍼를 사용해 리스트의 리스트 다루기
- 만약 parsing할 데이터가 저자, 제목, 출간일 같은 문맥 데이터일 때를 위해 고안된 것이 SequenceExample
- 이는 문맥 데이터를 위한 하나의 Features 객체와 이름이 있는 한 개 이상의 FeatureList를 가진 FeatureLists 객체를 포함
- Feature 객체는 바이트 스트링의 리스트나 64 비트 정수의 리스트, 실수의 리스트일 수 있음 (Feature가 하나의 문장이나 코멘트를 표현)
- SequenceExample를 만들고 직렬화하고 파싱하는 것은 Example을 만들고 직렬화하고 파싱하는 것과 비슷함
- 대신, tf.io.parse_single_sequence_example()을 사용
```python
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example (
   serialized_sequence_example, context_feature_descriptions,
   sequence_feature_descriptions)

parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists["content"] # 가변길이 시퀀스를 담고 있을 때 Ragged 텐서로 바꾸는 방법

njs03332 / ml_study