[Feat] 식당 정보 필터링 api 쿼리 개선

johan1103 commented 1 year ago

💡 개요

식당 정보를 요청하는 api에 사용되는 필터 쿼리문입니다.. 시간 측정을 해 본 결과 execution time과 fetch time이 각각 4s 527, 47ms가 소요되었습니다. OLTP 환경의 서비스에서 너무나도 느린 속도입니다. 해당 데이터베이스 적절한 인덱스를 설정해 줌으로서 MySQL 데이터베이스의 응답 속도를 개선하고자 합니다.

fetch time vs execution time

참고 링크 https://enterone.tistory.com/216 https://blogingming.tistory.com/entry/%EC%8B%A4%ED%96%89%EC%8B%9C%EA%B0%84-Execution-time-vs-%ED%8C%A8%EC%B9%98%EC%8B%9C%EA%B0%84-Fetch-time Chat GPT

fetch time 네트워크를 통해 서버에서 전체 결과 집합을 가져오는 데 걸리는 시간을 측정합니다.
execution time MySQL 서버가 네트워크를 통해 데이터를 보내고 받는 데 소요되는 시간을 제외하고 쿼리를 실행하고 결과 집합을 클라이언트에 반환하는 데 걸리는 시간을 측정합니다.

네트워크 시간을 고려하지 않은 execution time을 기준으로 쿼리 응답 속도를 개선하고자 합니다.

🤩 기능 설명

execution api 응답 속도 개선

개선할 쿼리

select r.title_image_url,r.name,r.score,rm.price,rm.name,ft.name, ST_Distance_Sphere(POINT(rl.longitude,rl.latitude),
    POINT(126.64639071503901,37.45152351651424)) as distance from restaurant r join food_type ft
        on r.food_type_id = ft.id join restaurant_menu rm on r.id = rm.restaurant_id join restaurant_location rl
            on r.id=rl.restaurant_id
            where rm.order_number = 1
              and rm.price <= 20000
              and ft.name = '한식'
              and ST_Distance_Sphere(POINT(126.65739071503901,37.45152351651424),
                  POINT(rl.longitude,rl.latitude))<200
            order by r.score DESC limit 20 offset 0

📖 참고 사항

참고 레퍼런스 Real mysql 8.0 1권 https://velog.io/@sihyung92/query-tunning-1-execution-plan

key word

인덱스
실행계획

TODO

restaurant_location 속성이 latitude & longitude가 삭제되고 location이 추가됐기 때문에 JPA 엔티티 정보를 수정해주어야 합니다.

[ ] #30

johan1103 commented 1 year ago

실행 계획

현재 해당 SQL의 실행계획은 다음과 같습니다.

현재 SQL

select r.title_image_url,r.name,r.score,rm.price,rm.name,ft.name, ST_Distance_Sphere(POINT(rl.longitude,rl.latitude),
    POINT(126.64639071503901,37.45152351651424)) as distance from restaurant r join food_type ft
        on r.food_type_id = ft.id join restaurant_menu rm on r.id = rm.restaurant_id join restaurant_location rl
            on r.id=rl.restaurant_id
            where rm.order_number = 1
              and rm.price <= 20000
              and ft.name = '한식'
              and ST_Distance_Sphere(POINT(126.65739071503901,37.45152351651424),
                  POINT(rl.longitude,rl.latitude))<200
            order by r.score DESC limit 20 offset 0;

실행 계획

johan1103 commented 1 year ago

개선 방안

✏️ order_number & price 인덱스

가게의 대표 메뉴 가격 정보를 가져오기 위해 서비스의 필터 조건이 어떻든 항상 restaurant_menu 테이블에서 order_number=1인 정보만 필터링해서 가져옵니다. 필터 조건에 order_number=1인 메뉴의 가격 제한이 들어가있기 때문에 멀티 칼럼 인덱스를 생성해주면 디스크에서 필요한 레코드를 빠른속도로 미리 필터링해서 가져올 수 있습니다.

가격정보는 범위 조건(<)이지만 order_number 조건은 동등조건(=)입니다. 동등 조건을 멀티 칼럼 인덱스에서 우선순위로 두게 되면 인덱스 레인지 스캔을 할 때 범위 조건 정보보다 더 효율적으로 원하는 칼럼들만 가져올 수 있습니다. 다시 말해, 가져오는 레코드 정보들 중에서 실제로 원하는 조건에 맞는 레코드를 가져오는 비율이 높은 경우가 많습니다.

따라서 order_number를 인덱스의 첫번째 정렬 조건으로 선택했습니다.

실행 계획

🚀 Table

🚀 Analyze

-> Limit: 20 row(s)  (actual time=948.878..948.894 rows=19 loops=1)
    -> Sort: r.score DESC, limit input to 20 row(s) per chunk  (actual time=948.878..948.891 rows=19 loops=1)
        -> Stream results  (cost=254510.04 rows=16746) (actual time=42.064..948.752 rows=19 loops=1)
            -> Nested loop inner join  (cost=254510.04 rows=16746) (actual time=42.055..948.655 rows=19 loops=1)
                -> Nested loop inner join  (cost=225254.56 rows=16746) (actual time=11.323..752.906 rows=69323 loops=1)
                    -> Inner hash join (no condition)  (cost=141292.13 rows=50239) (actual time=0.176..510.912 rows=95755 loops=1)
                        -> Filter: (rm.restaurant_id is not null)  (cost=70645.79 rows=100477) (actual time=0.057..504.309 rows=95755 loops=1)
                            -> Index range scan on rm using idx_rm_order over (order_number = 1 AND NULL < price <= 20000), with index condition: ((rm.order_number = 1) and (rm.price <= 20000))  (cost=70645.79 rows=200954) (actual time=0.057..500.464 rows=95755 loops=1)
                        -> Hash
                            -> Filter: (ft.`name` = '한식')  (cost=0.55 rows=1) (actual time=0.080..0.086 rows=1 loops=1)
                                -> Table scan on ft  (cost=0.55 rows=3) (actual time=0.072..0.076 rows=4 loops=1)
                    -> Filter: (r.food_type_id = ft.id)  (cost=0.39 rows=0.3) (actual time=0.002..0.002 rows=1 loops=95755)
                        -> Single-row index lookup on r using PRIMARY (id=rm.restaurant_id)  (cost=0.39 rows=1) (actual time=0.002..0.002 rows=1 loops=95755)
                -> Filter: (st_distance_sphere(<cache>(point(126.65739071503901,37.45152351651424)),point(rl.longitude,rl.latitude)) < 200)  (cost=0.41 rows=1) (actual time=0.003..0.003 rows=0 loops=69323)
                    -> Single-row index lookup on rl using PRIMARY (restaurant_id=rm.restaurant_id)  (cost=0.41 rows=1) (actual time=0.002..0.002 rows=1 loops=69323)

드라이빙 테이블은 food_type 테이블이지만 다음 드리븐 테이블인 restaurant_menu 테이블과 조인할 때 hash join을 사용하고 인덱스를 활용해 검색을 시작하는 테이블은 restaurant_menu 테이블임을 Table 형태의 실행계획에서 확인할 수 있습니다.

이후 Join 하는 테이블들은 전부 FK 혹은 PK 인덱스를 활용해서 Nested loop join으로 정보들을 가져온 후에 where조건에 맞는 레코드들을 걸러내는 것을 두 실행계획에서 알 수 있습니다.

그 결과 시간을 3.835248 초에서 1.85초로 단축할 수 있었습니다.

johan1103 commented 1 year ago

✏️ Location index

쿼리문에서 조인하는 여러개의 테이블에 인덱스가 각각 생성되어 있다면, 옵티마이저는 레코드 정보들을 가장 효율적으로 가져올 수 있는 인덱스의 테이블을 드라이빙 테이블로 선택합니다.(Hash join을 안 쓸 때)

order_number 와 price에 인덱스를 생성해 본 이유는 restaurant의 food_type의 카디널리티보다 order_number의 카디널리티가 훨씬 높았고 위치조건을 제외하면 order_number와 price 인덱스로 레코드를 걸러내는 것이 인덱스를 활용한 최대한의 필터링 방법이라고 생각 했었기 때문입니다.

쿼리 튜닝을 위해 Real_mysql을 읽던 와중 R-Tree로 위치정보에 대한 index를 생성하는 법을 알아냈고, restaurant_location 테이블의 위도 경도 속성을 삭제하고 location이라는 새로운 속성을 추가해서 location으로 인덱스로 생성하고자 했습니다.

🎯 location index 사용 이유

인덱스를 생성했음에도 옵티마이저가 인덱스를 사용하지 않는 이유는 대표적으로 두가지가 있을 수 있습니다.

테이블의 레코드 수가 너무 적어서 풀 테이블 스캔이 더 빠를 때
인덱스로 where 조건문(혹은 다른 조건)에 맞는 레코드의 수를 줄였더라도 총 레코드 수의 25%정도보다 많을 때

두 경우 다 인덱스 레인지 스캔을 사용해서 레코드를 읽어오는 방법보다 풀 테이블 스캔으로 레코드를 읽어오는 방법이 더 빠르기 때문에 옵티마이저는 인덱스를 사용하지 않습니다.

위치 정보를 인덱싱하는 경우 위 두가지 조건으로부터 항상 안전한것을 보장할 수 있습니다.

테이블의 레코드 수가 너무 적을 때

현재 restaurant_location에 있는 레코드 수는 총 19만여개로 이미 인덱스로 효과적인 필터링이 가능하면 옵티마이저가 인덱스를 선택할 수 있는 환경입니다.
인덱스로 레코드 수를 필터링해서 줄이더라도 전체보다 25%정보보다 많을 때

현재 저의 서비스는 최대 검색 거리를 10km로 제한하고 있고, 검색 거리를 필터 정보에 넣지 않은 경우 Default로 1km가 검색됩니다. 최악의 경우에 인덱스가 사용되지 않으려면 특정 위치의 반경 10km 이내에 식당의 수가 전국 전체 식당수의 1/4보다 많아야 합니다.

현재 위치정보의 레코드 수는 19만개로 1/4보다 많기 위해서는 약 4만여개의 식당이 10km안에 있어야 하는데 이는 현실적으로 불가능하다고 생각했습니다.

무엇보다 위치정보를 가장 효과적인 인덱스라고 생각한 이유는 해당 인덱스가 어떤 필터 정보보다 필터링 효과가 클 것이라고 생각했기 때문입니다.

전국에서 특정 위치의 반경 1km 식당만 필터링할 수 있다면 19만개 중에서 아무리 많아봐야 1000여개로 필터링 효과를 극대화 할 수 있기 때문에 이러한 위치정보를 인덱스 레인지 스캔할 수 있다면 가장 효과적인 인덱스 정보가 될 것이라고 생각했습니다.

🎨 수정 항목

restaurant_location 속성 변경

point자료형을 가진 위치정보로 인덱스를 생성해주어야 하기 때문에 기존의 위도 경도 정보를 테이블에 가지고 있는 것이 아닌, 위도 경도로 만들어진 하나의 위치정보로 테이블 속성을 수정했습니다.
```
ALTER TABLE restaurant_location MODIFY location POINT NOT NULL SRID 4326;
```
인덱스로 생성하기 위해서는 location의 속성은 NOT NULL이어야 합니다.

수정 결과 (resturant_location columns)
ST_Distance_Sphere

ST_Distance_Sphere는 Mysql에서 위도 경도정보를 토대로 거리를 계산해주는 function입니다. 해당 함수의 결과를 상수랑 비교하는 형태는 인덱스를 활용할 수 없는 형태입니다.

그래서 차선책으로 MBR을 이용한 ST_Whithin()혹은 ST_Contains()함수를 이용하면 인덱스를 활용할 수 있습니다. 해당 함수들은 두개의 파라미터로 각각 Geometry정보를 받고 ST_Within의 경우에는 두번째 파라미터 도형안에 첫번째 파라미터 도형이 들어가는지 여부를 파악하는 기능을 가진 함수입니다. (ST_Contains는 파라미터 순서만 반대)

해당 함수들은 값을 인덱스를 사용하지 못하는 함수들 처럼 파라미터 값을 연산을 통해서 처리하는 방식이 아닌 단순 파라미터 값 간의 비교이기 때문에 인덱스를 사용할 수 있습니다.

따라서 쿼리문의 거리 비교 함수를 ST_Within함수로 수정했습니다.

getDistanceMBR

쿼리를 작성할 때, 특정 위도 경도 반경내의 restaurant 위치 정보를 검색하기 위해서는 특정 위도 경도 기준으로 검색하고자 하는 반경 거리를 감싸는 MBR을 생성해야만 ST_Within함수를 활용해서 인덱스 검색을 할 수 있습니다.

따라서 주어진 위도 경도로 MBR을 생성해내는 함수를 작성해주었습니다. (function name : getDistanceMBR)

CREATE function getDistanceMBR(p_origin POINT, p_distanceKm DOUBLE) returns POLYGON
DETERMINISTIC
SQL SECURITY INVOKER
    BEGIN
        DECLARE v_originLat DOUBLE DEFAULT 0.0;
        DECLARE v_originLon DOUBLE DEFAULT 0.0;

        DECLARE v_deltaLon DOUBLE DEFAULT 0.0;
        DECLARE v_Lat1 DOUBLE DEFAULT 0.0;
        DECLARE v_Lon1 DOUBLE DEFAULT 0.0;
        DECLARE v_Lat2 DOUBLE DEFAULT 0.0;
        DECLARE v_Lon2 DOUBLE DEFAULT 0.0;

        SET v_originLat = ST_X(p_origin); /* = ST_Latitude(p_origin) for SRID=4326*/
        SET v_originLon = ST_Y(p_origin); /* = ST_Longitude(p_origin) for SRID=4326*/

        SET v_deltaLon = p_distanceKm / ABS(COS(RADIANS(v_originLat))*111.32);
        SET v_Lon1 = v_originLon-v_deltaLon;
        SET v_Lon2 = v_originLon+v_deltaLon;
        SET v_Lat1 = v_originLat - (p_distanceKm/111.32);
        SET v_Lat2 = v_originLat + (p_distanceKm/111.32);

        SET @mbr = ST_AsText(ST_Envelope(ST_GeomFromText(CONCAT("LINESTRING(",v_Lat1," ",
            v_Lon1,",",v_Lat2," ", v_Lon2,")"))));
        RETURN ST_PolygonFromText(@mbr, ST_SRID(p_origin));
    end;

🎨 수정 쿼리문

select r.title_image_url,r.name,r.score,rm.price,rm.name,ft.name,r.rid,
       ROUND(ST_Distance_Sphere(rl.location,ST_PointFromText('POINT(37.5636254 126.9080488)',4326)))
           as distance from restaurant r
               join food_type ft on r.food_type_id = ft.id
               join restaurant_menu rm on r.id = rm.restaurant_id
               join restaurant_location rl on r.id=rl.restaurant_id
            where rm.order_number = 1
              and rm.price <= 15000
              and ft.name = '중식'
              and ST_Within(location, getDistanceMBR(ST_PointFromText('POINT(37.5636254 126.9080488)',
                  4326),1))
            order by r.score desc limit 20 offset 0;

인덱스 생성

location 단일 컬럼을 가진 공간 인덱스를 생성해주었습니다.

create spatial index idx_rl_location on restaurant_location (location);

해당 인덱스는 B-Tree가 아닌 SPATIAL 인덱스로서, 자료구조로 2차원 B-Tree인 R-Tree를 사용합니다.

📋 실행 계획

-> Limit: 20 row(s)  (actual time=7.238..7.240 rows=10 loops=1)
    -> Sort: r.score DESC, limit input to 20 row(s) per chunk  (actual time=7.237..7.239 rows=10 loops=1)
        -> Stream results  (cost=16.66 rows=2) (actual time=0.464..7.216 rows=10 loops=1)
            -> Nested loop inner join  (cost=16.66 rows=2) (actual time=0.444..7.050 rows=10 loops=1)
                -> Nested loop inner join  (cost=8.31 rows=2) (actual time=0.392..6.184 rows=15 loops=1)
                    -> Filter: st_within(rl.location,<cache>(getDistanceMBR(st_pointfromtext('POINT(37.5636254 126.9080488)',4326),1)))  (cost=6.50 rows=5) (actual time=0.368..5.127 rows=276 loops=1)
                        -> Inner hash join (no condition)  (cost=6.50 rows=5) (actual time=0.258..1.632 rows=276 loops=1)
                            -> Index range scan on rl using idx_rl_location over (location unprintable_geometry_value)  (cost=5.95 rows=5) (actual time=0.222..1.508 rows=276 loops=1)
                            -> Hash
                                -> Filter: (ft.`name` = '중식')  (cost=0.55 rows=1) (actual time=0.024..0.027 rows=1 loops=1)
                                    -> Table scan on ft  (cost=0.55 rows=3) (actual time=0.019..0.023 rows=4 loops=1)
                    -> Filter: (r.food_type_id = ft.id)  (cost=0.27 rows=0.3) (actual time=0.004..0.004 rows=0 loops=276)
                        -> Single-row index lookup on r using PRIMARY (id=rl.restaurant_id)  (cost=0.27 rows=1) (actual time=0.003..0.003 rows=1 loops=276)
                -> Filter: ((rm.order_number = 1) and (rm.price <= 15000))  (cost=4.01 rows=1) (actual time=0.034..0.057 rows=1 loops=15)
                    -> Index lookup on rm using FKcf3h9qggpuiewy6h21s2j4e1h (restaurant_id=rl.restaurant_id)  (cost=4.01 rows=11) (actual time=0.032..0.056 rows=12 loops=15)

드라이빙 테이블로 food_type을 사용하고 있지만, 인덱스를 활용해서 컬럼들을 걸러내는 테이블은 restaurant_location 테이블임을 확인할 수 있습니다. restaurant_location 테이블은 location 인덱스인 idx_rl_location으로 인덱스 레인지 스캔을 사용해 레코드들을 필터링 하고 있는 것을 확인할 수 있습니다.

🎯 쿼리 실행 시간 측정

결과적으로 쿼리 실행시간을 15ms로 대폭 단축한 결과를 확인할 수 있었습니다.

vcho1958 commented 1 year ago

믾이 배워갑니당

posomo / saltit-server