4. streams and state - Githubissues

This chapter covers

- Applying stateful operations to Kafka Streams
- Using state stores for lookups and remembering previously seen data
- Joining streams for added insight
- How time and timestamps drive Kafka Streams

State is nothing more than the ability to recall information you’ve seen before and connect it to current information.

4.1 Thinking of events

결정이 필요한 이벤트 :
- 단시간에 3명의 투자자가 같은 주식을 살때 (Figure 4.1)
- 정부가 새로운 약물 승인한 이벤트가 발생했을떄,

4.1.1 Streams need state

State : Sometimes it’s easy to reason about what’s going on, but usually you need some context to make good decisions. When it comes to stream processing, we call that added context state.
Why local state is a fundamental primitive in stream processing
stream의 변화 비율은 DB 보다 더 빈번하게 변한다고 볼수있다.
필요한 이유는 조인을 해서 정보를 더 풍부하게 제공할려고 해서..

4.2 Applying stateful operations to Kafka streams

현재는 한번의 보상만 계산되어 처리되고 있음 -> 누적되게 변경할것임

4.2.1 The transformValues processor

KStream.transformValues : basic stateful function
- 의미상 KStream.mapValues() 과 같음
  - 차이점 -punctuate() 를 통해서 배치가 가능? -> 6단원에서 다시 배움..ㅎㅎ

4.2.2 Stateful customer rewards

3 장 코드 : mapValue -> RewardAccumulator 로 전달.

KStream<String, RewardAccumulator> rewardsKStream =  
purchaseKStream.mapValues(purchase -> 
   RewardAccumulator.builder(purchase).build()); 
rewardsKStream.to("rewards", Produced.with(stringSerde,rewardAccumulatorSerde));

public class RewardAccumulator {

    private String customerId;
    private double purchaseTotal;
    private int currentRewardPoints;

  //details left out for clarity
}

needs

이제 스트리밍 어플리케이션에서 포인트 계산이 가능하다
추가적으로 현재 와 마지막 구매 시간차이를 캡처하고 싶어한다.
보상 포인트가 임계치 도달하는지 안하는지 체크하는것도 필요

public class RewardAccumulator {

    private String customerId;
    private double purchaseTotal;
    private int currentRewardPoints;
    private int daysFromLastPurchase; //추가
    private long totalRewardPoints; // 추가

  //details left out for clarity
}

Specifically, you’ll take two main two steps:

Initialize the value transformer.
Map the Purchase object to a RewardAccumulator using state.

4.2.3 Initializing the value transformer

public class PurchaseRewardTransformer implements ValueTransformer<Purchase, RewardAccumulator> {

    private KeyValueStore<String, Integer> stateStore; //4.3.3 에 나옴
    private final String storeName;
    private ProcessorContext context;

    @Override
    @SuppressWarnings("unchecked")
    public void init(ProcessorContext context) {
        this.context = context;
        stateStore = (KeyValueStore) this.context.getStateStore(storeName);
    }

ValueTransformer에는 punctuate(), close() method를 포함하지 않는다. 이것 역시도.. Chapter6에서 다룸

4.2.4. Mapping the Purchase object to a RewardAccumulator using state

public RewardAccumulator transform(Purchase value) {
    RewardAccumulator rewardAccumulator =
 RewardAccumulator.builder(value).build();
    Integer accumulatedSoFar =
 stateStore.get(rewardAccumulator.getCustomerId()); // 1. Check for points accumulated so far by customer ID.

    if (accumulatedSoFar != null) {
         rewardAccumulator.addRewardPoints(accumulatedSoFar); // Sum the points for the current transaction and present the total.
    }
    stateStore.put(rewardAccumulator.getCustomerId(), 
                  rewardAccumulator.getTotalRewardPoints()); // Save the new total points by customer ID in the local state store.

    return rewardAccumulator;
}

Gathering information per sale for a given customer implies that all transactions for that customer are on the same partition ......? 갑자기?
왜냐하면 state store에서 id로 레코드를 조회해야하기 때문에, 동일한 파티션에 동일한 ID로 고객 트랜잭션을 배치하는게 중요하다.
다른말로 말하면, 같은 customerId가 다른 파티션에 뿌려져있다면, multiple state store에 같은 고객을 찾아야됨 (이게 각 파티션마다 각 state store을 갖는다는 의미가 이니고, partition은 StreamTask마다 할당되고, StreamTask마다 state store를 갖게됨)

해결책 : customerId 로 repartition

Repartitioning the data

key를 수정해서 새로운 topic으로 publish
key변경말고 StreamPartitioner를 통해서 변경 가능함

Repartitioning in Kafka Streams

KStream.through()
중간단계 없이 바로 through로 가면 된다.
through() 를 통해서 topology 단순화가 가능함. source, sink 일때 source가 동일할경우에..
일반적으로는 DefaultPartitioner``를 사용하면되고, custom 은StreamPartitioner```를 사용하면 된다.

RewardsStreamPartitioner streamPartitioner =
 new RewardsStreamPartitioner();

KStream<String, Purchase> transByCustomerStream =
 purchaseKStream.through("customer_transactions",
                               Produced.with(stringSerde,
                                             purchaseSerde,
                                             streamPartitioner));

Using a StreamPartitioner

public class RewardsStreamPartitioner implements StreamPartitioner<String, Purchase> {

    @Override
    public Integer partition(String key, Purchase value, int numPartitions) {
        return value.getCustomerId().hashCode() % numPartitions;
    }
}

> WARNING
단순히 repatitioning하는 것은 데이터 중복과 오버헤드가 발생한다.
map(), transform(), flatMap()이 자동 repartitioning이 가능하니까,
mapValues(), transformValues(), flatMapValues()는 언제든지 사용해도되지만
repartition은 논리적으론 최대한 이용안하는게 좋음.

4.2.5. Updating the rewards processor

KStream<String, RewardAccumulator> statefulRewardAccumulator =
 transByCustomerStream.transformValues(() -> new PurchaseRewardTransformer(rewardsStateStoreName), 
        rewardsStateStoreName); // ValueTransformerSupplier
statefulRewardAccumulator.to("rewards",
                              Produced.with(stringSerde,
                                       rewardAccumulatorSerde));

key와 state는 밀접하고, store state에 대해서 알아봤음

4.3. Using state stores for lookups and previously seen data

4.3.1 Data locality

성능에 중요함.
key로 찾는건 매우 빠르지만 원격저장소와 latency는 bottleneck 이 있을 수 있다.

4.3.2 Failure recovery and fault tolerance

local in-memory stroe를 내부 토픽으로 백업시켜둠.
각자 local store가 있기때문에 다른것에 영향을 끼치지 않음
하지만 state store backup 비용은 크다.
- 회피요인 : kafkaProducer는 레코드를 배치로 보내며, 레코드는 기본적으로 캐싱된다.
- kafka stream이 store에 record를 쓰는건 단지 cash flush일때만!!!
- 그래서 최신 레코드만 유지함 -> 5장에서 더 자세히 한다구함.

4.3.3. Using state stores in Kafka Streams

String rewardsStateStoreName = "rewardsPointsStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.inMemoryKeyValueStore(rewardsStateStoreName); //in-memory k/v store.

StoreBuilder<KeyValueStore<String, Integer>> storeBuilder =
 Stores.keyValueStoreBuilder(storeSupplier,
                                Serdes.String(),
                                Serdes.Integer());

builder.addStateStore(storeBuilder); //StreamBuilder

4.3.4. Additional key/value store suppliers

StateStore는 RocksDB로 로컬 스토리즈 이용.

4.3.5. StateStore fault tolerance

문제가 생겼을때 state store에는 기존 내용을 저장하고 있어서 문제가 없다고 말만함.(offset)

4.3.6. Configuring changelog topics

withLoggingEnabled(Map <String, String> config) 여기서 state store 설정 관리
Kafka Streams handles changelog topic creation for you.

4.4. Joining streams for added insight

ZMart + Coffee shop
전자제품 커피사면 커피 쿠폰 준다.

4.4.1. Data setup

Predicate<String, Purchase> coffeePurchase = (key, purchase) ->
 purchase.getDepartment().equalsIgnoreCase("coffee");

Predicate<String, Purchase> electronicPurchase = (key, purchase) ->
 purchase.getDepartment().equalsIgnoreCase("electronics");

final int COFFEE_PURCHASE = 0;
final int ELECTRONICS_PURCHASE = 1;

KStream<String, Purchase>[] branchedTransactions =
 transactionStream.branch(coffeePurchase, electronicPurchase); //create branch

join하기 위해서 두개의 branch로 나눔

4.4.2. Generating keys containing customer IDs to perform joins

KStream<String, Purchase>[] branchesStream =
  transactionStream.selectKey((k,v)->
  v.getCustomerId()).branch(coffeePurchase, electronicPurchase);

customerId가 key , FK

kafka stream에서 새로운 키를 생성하는 method(selectKey, map, or transform)가 호출되었을때 내부 boolean flag =true로 되며, 이값은 repartitioning을 의미하는 값이고, 자동으로 된다. 위에서는 reparititoning이 자동으로 발생하게 된다.(selectKey)

4.4.3. Constructing the join

Joining purchase records

public class PurchaseJoiner implements ValueJoiner<Purchase, Purchase, CorrelatedPurchase> {

    @Override
    public CorrelatedPurchase apply(Purchase purchase, Purchase otherPurchase) {

        CorrelatedPurchase.Builder builder = CorrelatedPurchase.newBuilder();

        Date purchaseDate = purchase != null ? purchase.getPurchaseDate() : null;
        Double price = purchase != null ? purchase.getPrice() : 0.0;
        String itemPurchased = purchase != null ? purchase.getItemPurchased() : null;

        Date otherPurchaseDate = otherPurchase != null ? otherPurchase.getPurchaseDate() : null;
        Double otherPrice = otherPurchase != null ? otherPurchase.getPrice() : 0.0;
        String otherItemPurchased = otherPurchase != null ? otherPurchase.getItemPurchased() : null;

        List<String> purchasedItems = new ArrayList<>();

        if (itemPurchased != null) {
            purchasedItems.add(itemPurchased);
        }

        if (otherItemPurchased != null) {
            purchasedItems.add(otherItemPurchased);
        }

        String customerId = purchase != null ? purchase.getCustomerId() : null;
        String otherCustomerId = otherPurchase != null ? otherPurchase.getCustomerId() : null;

        builder.withCustomerId(customerId != null ? customerId : otherCustomerId)
                .withFirstPurchaseDate(purchaseDate)
                .withSecondPurchaseDate(otherPurchaseDate)
                .withItemsPurchased(purchasedItems)
                .withTotalAmount(price + otherPrice);

        return builder.build();
    }
}

CorrelatedPurchase object를 생성함.
null 체크를 통해 inner, outer, left-outer join을 사용할수 있음.

Implementing the join

        KStream<String, Purchase> coffeeStream = branchesStream[COFFEE_PURCHASE];
        KStream<String, Purchase> electronicsStream = branchesStream[ELECTRONICS_PURCHASE];

        ValueJoiner<Purchase, Purchase, CorrelatedPurchase> purchaseJoiner = new PurchaseJoiner(); // init value joiner
        JoinWindows twentyMinuteWindow =  JoinWindows.of(60 * 1000 * 20);

        KStream<String, CorrelatedPurchase> joinedKStream = coffeeStream.join(electronicsStream,
                                                                              purchaseJoiner,
                                                                              twentyMinuteWindow,
                                                                              Joined.with(stringSerde,
                                                                                          purchaseSerde,
                                                                                          purchaseSerde)); // construct join

        joinedKStream.print(Printed.<String, CorrelatedPurchase>toSysOut().withLabel("joined KStream"));

joinWindow : join 할수 있는 두값의 시간 최대 차이를 의미함.
timestamp는 kafka timestamp값이 아니고 실제 transaction timestamp를 의미함.
StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG to use TransactionTimestampExtractor.class. 설정하도록....

Co-partitioning

조인하기 전에 paritition 수가 같은지 확인해야한다. -> TopologyBuilderException
key가 같은 타입인지도 확인

4.4.4. Other join options

Outer Join

coffeeStream.outerJoin(electronicsStream,..)

Left-outer join

coffeeStream.leftJoin(electronicsStream..)

4.5 Timestamps in Kafka Streams

kafka stream에서 timestamp의 역할
1. Joining streams
2. Updating a changelog (KTable API)
3. Deciding when the Processor.punctuate() method is triggered (Processor API)
stream처리에서 timestamp 종류
1. event time : 레코드 시간 ( metadata timestamp쓰는게 효율적)
2. ingestion time : 들어갈때 시간
3. processing time : 파이프라인을 처음으로 통과할때 시간

TimestampExtractor interface를 이용해서 custom하게 구현할수 있다.

4.5.1 Provided TimestampExtractor implementations

/* @see FailOnInvalidTimestamp
 * @see LogAndSkipOnInvalidTimestamp
 * @see UsePreviousTimeOnInvalidTimestamp
 * @see WallclockTimestampExtractor
 */
@InterfaceStability.Evolving
abstract class ExtractRecordMetadataTimestamp implements TimestampExtractor {

    /**
     * Extracts the embedded metadata timestamp from the given {@link ConsumerRecord}.
     *
     * @param record a data record
     * @param previousTimestamp the latest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
     * @return the embedded metadata timestamp of the given {@link ConsumerRecord}
     */
    @Override
    public long extract(final ConsumerRecord<Object, Object> record, final long previousTimestamp) {
        final long timestamp = record.timestamp();

        if (timestamp < 0) {
            return onInvalidTimestamp(record, timestamp, previousTimestamp);
        }

        return timestamp;
    }

    /**
     * Called if no valid timestamp is embedded in the record meta data.
     *
     * @param record a data record
     * @param recordTimestamp the timestamp extractor from the record
     * @param previousTimestamp the latest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
     * @return a new timestamp for the record (if negative, record will not be processed but dropped silently)
     */
    public abstract long onInvalidTimestamp(final ConsumerRecord<Object, Object> record,
                                            final long recordTimestamp,
                                            final long previousTimestamp);
}

4.5.2. WallclockTimestampExtractor

System.currentTimeMillis()

4.5.3. Custom TimestampExtractor

4.5.4. Specifying a TimestampExtractor

public class TransactionTimestampExtractor implements TimestampExtractor {

    @Override
    public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
        Purchase purchasePurchaseTransaction = (Purchase) record.value();
        return purchasePurchaseTransaction.getPurchaseDate().getTime();
    }
}

props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, TransactionTimestampExtractor.class);

Summary

State
abstractions for stateful transformations , join
State Store
Timestamp

kgneng2 / blokg

4. streams and state #10