Open n0099 opened 4 months ago
tbmi_metadata_embedded_exif
结构:
CREATE TABLE `tbmi_metadata_embedded_exif` (
`imageId` int unsigned NOT NULL,
`orientation` text,
`imageDescription` text,
`userComment` blob,
`artist` text,
`xpAuthor` blob,
`copyright` text,
`imageUniqueId` text,
`bodySerialNumber` text,
`make` text,
`model` text,
`software` text,
`customRendered` smallint DEFAULT NULL,
`dateTime` datetime DEFAULT NULL,
`dateTimeOffset` text,
`dateTimeDigitized` datetime DEFAULT NULL,
`dateTimeDigitizedOffset` text,
`dateTimeOriginal` datetime DEFAULT NULL,
`dateTimeOriginalOffset` text,
`offsetTime` text,
`offsetTimeDigitized` text,
`offsetTimeOriginal` text,
`gpsDateTime` datetime DEFAULT NULL,
`gpsCoordinate` point DEFAULT NULL,
`gpsImgDirection` float DEFAULT NULL,
`gpsImgDirectionRef` text,
`rawBytes` blob,
`xxHash3` binary(8) NOT NULL,
PRIMARY KEY (`imageId`)
) ENGINE=InnoDB CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
由pgloader在pgsql端按照2.3.
中mysql://
源下默认的CASTING
所创建:
create table tbm.tbmi_metadata_embedded_exif (
"imageId" bigint primary key not null,
orientation text,
"imageDescription" text,
"userComment" text,
artist text,
"xpAuthor" text,
copyright text,
"imageUniqueId" text,
"bodySerialNumber" text,
make text,
model text,
software text,
"customRendered" smallint,
"dateTime" timestamp with time zone,
"dateTimeOffset" text,
"dateTimeDigitized" timestamp with time zone,
"dateTimeDigitizedOffset" text,
"dateTimeOriginal" timestamp with time zone,
"dateTimeOriginalOffset" text,
"offsetTime" text,
"offsetTimeDigitized" text,
"offsetTimeOriginal" text,
"gpsDateTime" timestamp with time zone,
"gpsCoordinate" point,
"gpsImgDirection" double precision,
"gpsImgDirectionRef" text,
"rawBytes" bytea,
"xxHash3" bytea not null
);
10.1. 使用5.1.3.
中的symmetric difference FULL OUTER JOIN row(table)
SELECT COUNT(*) FROM ONLY mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
可得7.1.
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
10.1.1. 执行5.1.1.
SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b
USING ("imageId", "orientation", "imageDescription", "userComment", "artist", "xpAuthor", "copyright", "imageUniqueId", "bodySerialNumber", "make", "model", "software", "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3")
WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
OR (b."imageId" IS NULL AND b."orientation" IS NULL AND b."imageDescription" IS NULL AND b."userComment" IS NULL AND b."artist" IS NULL AND b."xpAuthor" IS NULL AND b."copyright" IS NULL AND b."imageUniqueId" IS NULL AND b."bodySerialNumber" IS NULL AND b."make" IS NULL AND b."model" IS NULL AND b."software" IS NULL AND b."customRendered" IS NULL AND b."dateTime" IS NULL AND b."dateTimeOffset" IS NULL AND b."dateTimeDigitized" IS NULL AND b."dateTimeDigitizedOffset" IS NULL AND b."dateTimeOriginal" IS NULL AND b."dateTimeOriginalOffset" IS NULL AND b."offsetTime" IS NULL AND b."offsetTimeDigitized" IS NULL AND b."offsetTimeOriginal" IS NULL AND b."gpsDateTime" IS NULL AND b."gpsCoordinate" IS NULL AND b."gpsImgDirection" IS NULL AND b."gpsImgDirectionRef" IS NULL AND b."rawBytes" IS NULL AND b."xxHash3" IS NULL);
会有
ERROR: operator does not exist: point = point
Hint: No operator matches the given name and argument types. You might need to add explicit type casts.
10.1.2. 按照 https://dba.stackexchange.com/questions/252066/how-to-formulate-equality-predicate-on-point-column-in-postgresql/252112#252112 使用5.1.2.1.
SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON
(a."imageId" = b."imageId") AND
(a."orientation" = b."orientation" OR (a."orientation" IS NULL AND b."orientation" IS NULL)) AND
(a."imageDescription" = b."imageDescription" OR (a."imageDescription" IS NULL AND b."imageDescription" IS NULL)) AND
(a."userComment" = b."userComment" OR (a."userComment" IS NULL AND b."userComment" IS NULL)) AND
(a."artist" = b."artist" OR (a."artist" IS NULL AND b."artist" IS NULL)) AND
(a."xpAuthor" = b."xpAuthor" OR (a."xpAuthor" IS NULL AND b."xpAuthor" IS NULL)) AND
(a."copyright" = b."copyright" OR (a."copyright" IS NULL AND b."copyright" IS NULL)) AND
(a."imageUniqueId" = b."imageUniqueId" OR (a."imageUniqueId" IS NULL AND b."imageUniqueId" IS NULL)) AND
(a."bodySerialNumber" = b."bodySerialNumber" OR (a."bodySerialNumber" IS NULL AND b."bodySerialNumber" IS NULL)) AND
(a."make" = b."make" OR (a."make" IS NULL AND b."make" IS NULL)) AND
(a."model" = b."model" OR (a."model" IS NULL AND b."model" IS NULL)) AND
(a."software" = b."software" OR (a."software" IS NULL AND b."software" IS NULL)) AND
(a."customRendered" = b."customRendered" OR (a."customRendered" IS NULL AND b."customRendered" IS NULL)) AND
(a."dateTime" = b."dateTime" OR (a."dateTime" IS NULL AND b."dateTime" IS NULL)) AND
(a."dateTimeOffset" = b."dateTimeOffset" OR (a."dateTimeOffset" IS NULL AND b."dateTimeOffset" IS NULL)) AND
(a."dateTimeDigitized" = b."dateTimeDigitized" OR (a."dateTimeDigitized" IS NULL AND b."dateTimeDigitized" IS NULL)) AND
(a."dateTimeDigitizedOffset" = b."dateTimeDigitizedOffset" OR (a."dateTimeDigitizedOffset" IS NULL AND b."dateTimeDigitizedOffset" IS NULL)) AND
(a."dateTimeOriginal" = b."dateTimeOriginal" OR (a."dateTimeOriginal" IS NULL AND b."dateTimeOriginal" IS NULL)) AND
(a."dateTimeOriginalOffset" = b."dateTimeOriginalOffset" OR (a."dateTimeOriginalOffset" IS NULL AND b."dateTimeOriginalOffset" IS NULL)) AND
(a."offsetTime" = b."offsetTime" OR (a."offsetTime" IS NULL AND b."offsetTime" IS NULL)) AND
(a."offsetTimeDigitized" = b."offsetTimeDigitized" OR (a."offsetTimeDigitized" IS NULL AND b."offsetTimeDigitized" IS NULL)) AND
(a."offsetTimeOriginal" = b."offsetTimeOriginal" OR (a."offsetTimeOriginal" IS NULL AND b."offsetTimeOriginal" IS NULL)) AND
(a."gpsDateTime" = b."gpsDateTime" OR (a."gpsDateTime" IS NULL AND b."gpsDateTime" IS NULL)) AND
(a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
(a."gpsImgDirection" = b."gpsImgDirection" OR (a."gpsImgDirection" IS NULL AND b."gpsImgDirection" IS NULL)) AND
(a."gpsImgDirectionRef" = b."gpsImgDirectionRef" OR (a."gpsImgDirectionRef" IS NULL AND b."gpsImgDirectionRef" IS NULL)) AND
(a."rawBytes" = b."rawBytes" OR (a."rawBytes" IS NULL AND b."rawBytes" IS NULL)) AND
(a."xxHash3" = b."xxHash3")
WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
OR (b."imageId" IS NULL AND b."orientation" IS NULL AND b."imageDescription" IS NULL AND b."userComment" IS NULL AND b."artist" IS NULL AND b."xpAuthor" IS NULL AND b."copyright" IS NULL AND b."imageUniqueId" IS NULL AND b."bodySerialNumber" IS NULL AND b."make" IS NULL AND b."model" IS NULL AND b."software" IS NULL AND b."customRendered" IS NULL AND b."dateTime" IS NULL AND b."dateTimeOffset" IS NULL AND b."dateTimeDigitized" IS NULL AND b."dateTimeDigitizedOffset" IS NULL AND b."dateTimeOriginal" IS NULL AND b."dateTimeOriginalOffset" IS NULL AND b."offsetTime" IS NULL AND b."offsetTimeDigitized" IS NULL AND b."offsetTimeOriginal" IS NULL AND b."gpsDateTime" IS NULL AND b."gpsCoordinate" IS NULL AND b."gpsImgDirection" IS NULL AND b."gpsImgDirectionRef" IS NULL AND b."rawBytes" IS NULL AND b."xxHash3" IS NULL);
可得
ERROR: invalid input syntax for type point: "�"
其分别是 https://codepoints.net/U+FFFE 的utf8和0x10
https://codepoints.net/U+0010 (github将其便乘了两个U+FFFE
)
10.1.3. 即便尝试::text
也仍然相同
- (a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
+ (a."gpsCoordinate"::text = b."gpsCoordinate"::text OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
事实核查:截止2024年4月,mysql_fdw仍然不支持转换任何gis类型 https://github.com/EnterpriseDB/mysql_fdw/issues/260
10.2.mysql创建视图
(类似4.3.
中的表继承
,当然pgsql也有着isosql
中的视图)
CREATE OR REPLACE VIEW tbmi_metadata_embedded_exif_fdw
AS SELECT *,
ST_Latitude(gpsCoordinate) gpsCoordinateLat,
ST_Longitude(gpsCoordinate) gpsCoordinateLng
FROM tbmi_metadata_embedded_exif;
SELECT imageId, gpsCoordinate, gpsCoordinateLat, gpsCoordinateLng
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL;
清理:
DROP VIEW tbmi_metadata_embedded_exif_fdw;
10.2.1. 按照4.3.
DROP FOREIGN TABLE mysql."tbmi_metadata_embedded_exif";
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_metadata_embedded_exif"(
"gpsCoordinateLat" float8,
"gpsCoordinateLng" float8)
INHERITS (tbm."tbmi_metadata_embedded_exif")
SERVER mysql
OPTIONS (dbname 'tbm', table_name 'tbmi_metadata_embedded_exif_fdw');
- (a."gpsCoordinate" ~= b."gpsCoordinate" OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
+ ((a."gpsCoordinateLat" = b."gpsCoordinate"[0] AND a."gpsCoordinateLng" = b."gpsCoordinate"[1])
+ OR (a."gpsCoordinate" IS NULL AND b."gpsCoordinate" IS NULL)) AND
-WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
+WHERE (a."imageId" IS NULL AND a."orientation" IS NULL AND a."imageDescription" IS NULL AND a."userComment" IS NULL AND a."artist" IS NULL AND a."xpAuthor" IS NULL AND a."copyright" IS NULL AND a."imageUniqueId" IS NULL AND a."bodySerialNumber" IS NULL AND a."make" IS NULL AND a."model" IS NULL AND a."software" IS NULL AND a."customRendered" IS NULL AND a."dateTime" IS NULL AND a."dateTimeOffset" IS NULL AND a."dateTimeDigitized" IS NULL AND a."dateTimeDigitizedOffset" IS NULL AND a."dateTimeOriginal" IS NULL AND a."dateTimeOriginalOffset" IS NULL AND a."offsetTime" IS NULL AND a."offsetTimeDigitized" IS NULL AND a."offsetTimeOriginal" IS NULL AND a."gpsDateTime" IS NULL AND a."gpsCoordinate" IS NULL AND a."gpsCoordinateLat" IS NULL AND a."gpsCoordinateLng" IS NULL AND a."gpsImgDirection" IS NULL AND a."gpsImgDirectionRef" IS NULL AND a."rawBytes" IS NULL AND a."xxHash3" IS NULL)
10.2.2. 然而
ERROR: failed to execute the MySQL query:
Invalid GIS data provided to function st_latitude.
https://dev.mysql.com/doc/refman/8.0/en/gis-point-property-functions.html
- If any geometry argument is not a syntactically well-formed geometry, an ER_GIS_INVALID_DATA error occurs.
- If a longitude or latitude value is out of range, an error occurs:
- If a longitude value is not in the range (−180, 180], an ER_LONGITUDE_OUT_OF_RANGE error occurs.
- If a latitude value is not in the range [−90, 90], an ER_LATITUDE_OUT_OF_RANGE error occurs.
https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_gis_invalid_data
Error number: 3037; Symbol: ER_GIS_INVALID_DATA; SQLSTATE: 22023 Message: Invalid GIS data provided to function %s. A spatial function was called with an argument not recognized as a valid geometry value.
10.2.3. 但即便直接在mysql端绕过视图换成ST_X()
ST_Y()
也一样
SELECT COUNT(ST_X(gpsCoordinate)) FROM tbmi_metadata_embedded_exif WHERE gpsCoordinate IS NOT NULL;
[3037] Data truncation: Invalid GIS data provided to function st_x.
10.2.4. 但这又正常
SELECT COUNT(*) FROM (
SELECT gpsCoordinate, ST_X(gpsCoordinate) FROM tbmi_metadata_embedded_exif WHERE gpsCoordinate IS NOT NULL
) t;
-- 10026
10.3. 换wkt https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry 表达
CREATE OR REPLACE VIEW tbmi_metadata_embedded_exif_fdw
AS SELECT *,
ST_AsWKT(gpsCoordinate) gpsCoordinateWKT,
ST_AsWKB(gpsCoordinate) gpsCoordinateWKB
FROM tbmi_metadata_embedded_exif;
SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKT) FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate IS NOT NULL;
COUNT(gpsCoordinate) |
COUNT(gpsCoordinateWKT) |
---|---|
10026 | 10016 |
可得有10行wkt为NULL
10.3.1. 直接查看mysql内部存储gis类型
所用的二进制encoding
SELECT COUNT(gpsCoordinate), gpsCoordinate, COUNT(gpsCoordinateWKT), gpsCoordinateWKT
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL AND gpsCoordinateWKT IS NULL
GROUP BY gpsCoordinate, gpsCoordinateWKT;
COUNT(gpsCoordinate) |
gpsCoordinate | COUNT(gpsCoordinateWKT) |
gpsCoordinateWKT |
---|---|---|---|
9 | 0xE61000000101000000000000000000F8FF000000000000F8FF | 0 | null |
1 | 0xE61000000101000000000000000000F8FF98B868DB43A13E40 | 0 | null |
使用 https://rodic.fr/wp-content/uploads/2015/11/geom_converter.html 无法解析wkb之0xE61000000101000000000000000000F8FF000000000000F8FF
而 https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary 对此早有预言
The first byte indicates the byte order for the data:
- 00 : big endian
- 01 : little endian
手动修改第一个字节0xE6
为0x00
后的001000000101000000000000000000F8FF000000000000F8FF
得到POINT(7.291122019556398e-304 1.352623777034994e-309)
暂且蒙古 https://github.com/n0099/open-tbm/blob/252710bf61ca63cc78eeb6373b4c564e8dbb0217/c%23/imagePipeline/src/Consumer/MetadataConsumer.cs#L200 是如何产生这10个神必丶的
10.3.2. 但实际上mysql内部存储gis类型
所用的二进制encoding
与wkb并不一致
SELECT gpsCoordinate, gpsCoordinateWKB, gpsCoordinateWKT FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate != gpsCoordinateWKB;
gpsCoordinate | gpsCoordinateWKB | gpsCoordinateWKT |
---|---|---|
0xE61000000101000000FD4DC6BCFF565C400BCD751A69193740 |
0x01010000000BCD751A69193740FD4DC6BCFF565C40 |
POINT(23.099259999999997 113.35935897222221) |
0xE61000000101000000FD4DC6BCFF565C400BCD751A69193740 |
0x01010000000BCD751A69193740FD4DC6BCFF565C40 |
POINT(23.099259999999997 113.35935897222221) |
0xE61000000101000000E605D847A70C5E40B88E71C5C5293E40 |
0x0101000000B88E71C5C5293E40E605D847A70C5E40 |
POINT(30.163173999999998 120.19771) |
0xE6100000010100000012F4FF7FB67861406A96FF1F91D94140 |
0x01010000006A96FF1F91D9414012F4FF7FB6786140 |
POINT(35.69974136333333 139.77227783194445) |
0xE6100000010100000016F1FF1F9679614082CBFFFFF2D04140 |
0x010100000082CBFFFFF2D0414016F1FF1F96796140 |
POINT(35.63241577138889 139.79957580555555) |
0xE61000000101000000B4EEFF9F9A79614005FDFF9FF7D04140 |
0x010100000005FDFF9FF7D04140B4EEFF9F9A796140 |
POINT(35.63255691527778 139.80012512194446) |
... |
10.3.3. 而如果只看wkb则所有行都能转为wkb
SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKB) FROM tbmi_metadata_embedded_exif_fdw WHERE gpsCoordinate IS NOT NULL;
COUNT(gpsCoordinate) |
COUNT(gpsCoordinateWKB) |
---|---|
10026 | 10026 |
10.3.4. 扩展10.3.1.
可得
SELECT COUNT(gpsCoordinate), gpsCoordinate,
COUNT(gpsCoordinateWKT), gpsCoordinateWKT,
COUNT(gpsCoordinateWKB), gpsCoordinateWKB,
ST_PointFromWKB(gpsCoordinateWKB) a
FROM tbmi_metadata_embedded_exif_fdw
WHERE gpsCoordinate IS NOT NULL AND gpsCoordinateWKT IS NULL
GROUP BY gpsCoordinate, gpsCoordinateWKT, gpsCoordinateWKB;
COUNT(gpsCoordinate) |
gpsCoordinate | COUNT(gpsCoordinateWKT) |
gpsCoordinateWKT | COUNT(gpsCoordinateWKB) |
gpsCoordinateWKB | a |
---|---|---|---|---|---|---|
9 | 0xE61000000101000000000000000000F8FF000000000000F8FF | 0 | null | 9 | 0x0101000000000000000000F8FF000000000000F8FF | 0x000000000101000000000000000000F8FF000000000000F8FF |
1 | 0xE61000000101000000000000000000F8FF98B868DB43A13E40 | 0 | null | 1 | 0x010100000098B868DB43A13E40000000000000F8FF | 0x00000000010100000098B868DB43A13E40000000000000F8FF |
0x0101000000000000000000F8FF000000000000F8FF
是POINT(NaN NaN)
0x010100000098B868DB43A13E40000000000000F8FF
是POINT(30.629941666666667 NaN)
值得注意的是重新ST_PointFromWKB()
使得头4字节0xE610
便乘了0x0000
暂且怀疑是 https://github.com/PomeloFoundation/Pomelo.EntityFrameworkCore.MySql/blob/156585fd44ddad7e2594ef99176effac39f92a3c/test/EFCore.MySql.FunctionalTests/MySqlMigrationsSqlGeneratorTest.cs#L133 亦或其幕后的 https://github.com/search?q=repo%3ANetTopologySuite%2FNetTopologySuite+%22E610%22&type=code 及其深层transitive依赖
在INSERT
时(因而在.net层
而非mysql层
完成了转换所以不一致)就写入了mysql内部存储gis类型
所用的二进制encoding
而非wkb/t
10.3.5. 而在INSERT INTO
时并不会检查point
的x
y
是否为NaN
https://bugs.mysql.com/bug.php?id=80123
SELECT COUNT(gpsCoordinate), gpsCoordinate, ST_IsValid(gpsCoordinate)
FROM tbmi_metadata_embedded_exif
WHERE gpsCoordinate IS NOT NULL AND ST_AsWKT(gpsCoordinate) IS NULL
GROUP BY gpsCoordinate;
[3037] Data truncation: Invalid GIS data provided to function st_isvalid.
10.4. 由于pgsql本身不提供gis用途的ST_*()
函数如 https://postgis.net/docs/ST_PointFromWKB.html 只提供了类型供存储因而需要手动安装所谓的后地理信息系统
https://en.wikipedia.org/wiki/PostGIS 扩展 https://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS3UbuntuPGSQLApt
sudo apt install -y postgresql-16-postgis-3
CREATE EXTENSION IF NOT EXISTS postgis;
清理:
DROP EXTENSION IF EXISTS postgis;
sudo apt purge -y postgresql-16-postgis-3 &&
sudo apt autoremove --purge
10.5. 如同9.2.
mysql端
CREATE TABLE tbmi_metadata_embedded_exif_wkb LIKE tbmi_metadata_embedded_exif;
INSERT INTO tbmi_metadata_embedded_exif_wkb SELECT * FROM tbmi_metadata_embedded_exif;
ALTER TABLE tbmi_metadata_embedded_exif_wkb ADD COLUMN gpsCoordinateWKB BLOB NULL AFTER gpsCoordinate;
UPDATE tbmi_metadata_embedded_exif_wkb SET gpsCoordinateWKB = ST_AsWKB(gpsCoordinate) WHERE gpsCoordinate IS NOT NULL;
SELECT COUNT(gpsCoordinate), COUNT(gpsCoordinateWKB) FROM tbmi_metadata_embedded_exif_wkb;
-- expecting the same count
SELECT COUNT(*) FROM tbmi_metadata_embedded_exif_wkb
WHERE (gpsCoordinate IS NULL AND gpsCoordinateWKB IS NOT NULL)
OR (gpsCoordinateWKB IS NULL AND gpsCoordinate IS NOT NULL);
-- expecting 0 rows returned, then:
-- ALTER TABLE tbmi_metadata_embedded_exif_wkb DROP COLUMN gpsCoordinate;
清理:
DROP TABLE tbmi_metadata_embedded_exif_wkb;
10.5.1. pgsql端准备重新导入
DROP TABLE tbm."tbmi_metadata_embedded_exif" CASCADE;
类比7.1.5.
导入mysql表tbmi_metadata_embedded_exif_wkb
CREATE FOREIGN TABLE IF NOT EXISTS mysql."tbmi_metadata_embedded_exif_wkb"()
INHERITS (tbm."tbmi_metadata_embedded_exif_wkb")
SERVER mysql
OPTIONS (dbname 'tbm', table_name 'tbmi_metadata_embedded_exif_wkb');
10.5.2. 由于根据10.
该表中有4个从datetime
mysql类型列转换来的timestamptz
pgsql类型列因而需要先7.2.1.
再重新执行10.1.
校验
"dateTime" timestamp with time zone,
"dateTimeDigitized" timestamp with time zone,
"dateTimeOriginal" timestamp with time zone,
"gpsDateTime" timestamp with time zone,
10.5.3. 10.5.1.
后有pgsql表结构
create table tbm.tbmi_metadata_embedded_exif_wkb (
"imageId" bigint primary key not null,
orientation text,
"imageDescription" text,
"userComment" bytea,
artist text,
"xpAuthor" bytea,
copyright text,
"imageUniqueId" text,
"bodySerialNumber" text,
make text,
model text,
software text,
"customRendered" smallint,
"dateTime" timestamp with time zone,
"dateTimeOffset" text,
"dateTimeDigitized" timestamp with time zone,
"dateTimeDigitizedOffset" text,
"dateTimeOriginal" timestamp with time zone,
"dateTimeOriginalOffset" text,
"offsetTime" text,
"offsetTimeDigitized" text,
"offsetTimeOriginal" text,
"gpsDateTime" timestamp with time zone,
"gpsCoordinateWKB" bytea,
"gpsImgDirection" double precision,
"gpsImgDirectionRef" text,
"rawBytes" bytea,
"xxHash3" bytea not null
);
由于pgsql长期以来无法像mysql那样一键移动 https://stackoverflow.com/questions/6805426/how-to-move-columns-in-a-mysql-table 列顺序 https://stackoverflow.com/questions/285733/how-do-i-alter-the-position-of-a-column-in-a-postgresql-database-table http://wiki.postgresql.org/wiki/Alter_column_position 因而需要手动创建
-create table tbm.tbmi_metadata_embedded_exif_wkb (
+create table tbm.tbmi_metadata_embedded_exif (
- "gpsCoordinateWKB" bytea,
+ "gpsCoordinate" point,
10.5.4. 如同10.5.
pgsql端
DROP FOREIGN TABLE mysql."tbmi_metadata_embedded_exif_wkb"; -- 避免`4.3.2.`
ALTER TABLE tbm."tbmi_metadata_embedded_exif_wkb" ADD COLUMN "gpsCoordinate" POINT;
UPDATE tbm."tbmi_metadata_embedded_exif_wkb" SET "gpsCoordinate" = st_pointfromwkb("gpsCoordinateWKB")::point WHERE "gpsCoordinateWKB" IS NOT NULL;
SELECT COUNT("gpsCoordinate"), COUNT("gpsCoordinateWKB") FROM tbm."tbmi_metadata_embedded_exif_wkb";
-- expecting the same count
SELECT COUNT(*) FROM tbm."tbmi_metadata_embedded_exif_wkb"
WHERE ("gpsCoordinate" IS NULL AND "gpsCoordinateWKB" IS NOT NULL)
OR ("gpsCoordinateWKB" IS NULL AND "gpsCoordinate" IS NOT NULL);
-- expecting 0 rows returned, then:
ALTER TABLE tbm."tbmi_metadata_embedded_exif_wkb" DROP COLUMN "gpsCoordinateWKB";
INSERT INTO tbm."tbmi_metadata_embedded_exif"
("imageId", orientation, "imageDescription", "userComment", artist, "xpAuthor", copyright, "imageUniqueId", "bodySerialNumber", make, model, software, "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3")
SELECT "imageId", orientation, "imageDescription", "userComment", artist, "xpAuthor", copyright, "imageUniqueId", "bodySerialNumber", make, model, software, "customRendered", "dateTime", "dateTimeOffset", "dateTimeDigitized", "dateTimeDigitizedOffset", "dateTimeOriginal", "dateTimeOriginalOffset", "offsetTime", "offsetTimeDigitized", "offsetTimeOriginal", "gpsDateTime", "gpsCoordinate", "gpsImgDirection", "gpsImgDirectionRef", "rawBytes", "xxHash3"
FROM tbm."tbmi_metadata_embedded_exif_wkb";
10.5.5. 按10.1.
校验
-SELECT COUNT(*) FROM mysql."tbmi_metadata_embedded_exif" a FULL OUTER JOIN ONLY tbm."tbmi_metadata_embedded_exif" b ON
+SELECT COUNT(*) FROM tbm."tbmi_metadata_embedded_exif_wkb" a FULL OUTER JOIN tbm."tbmi_metadata_embedded_exif" b ON
清理:
DROP TABLE tbmi_metadata_embedded_exif_wkb;
VACUUM FULL tbmi_metadata_embedded_exif;
10.6. 然而10.3.1.
中的那10行中除POINT(30.629941666666667 NaN)
外的⑨行仍然无法st_pointfromwkb()
变为NULL
且10.3.5.
会对POINT(30.629941666666667 NaN)
外的⑨行呈现假阴性 https://trac.osgeo.org/postgis/ticket/5718
SELECT "gpsCoordinate",
"gpsCoordinateWKB",
st_isvalid(st_pointfromwkb("gpsCoordinateWKB"))
FROM tbmi_metadata_embedded_exif_wkb
WHERE ("gpsCoordinate" IS NULL AND "gpsCoordinateWKB" IS NOT NULL)
OR ("gpsCoordinateWKB" IS NULL AND "gpsCoordinate" IS NOT NULL);
暂且理解为ST_IsValid()
的实现不同
何谓10.4.
所装ST_*()
函数背后的超时空
精神 https://stackoverflow.com/questions/7234679/what-is-st-in-postgis https://gis.stackexchange.com/questions/286808/what-does-the-st-prefix-stand-for-e-g-in-st-polygonize
10.6.1. 理论上可以将point
类型列拆分存储为两个IEEE754类型x
y
列因为pgsql不像mysql https://stackoverflow.com/questions/41936403/mysql-ieee-floating-point-nan-positiveinfinity-negativeinfinity ,支持 https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-FLOAT 存储IEEE754中3大臭名昭著浪费空间 https://stackoverflow.com/questions/19800415/why-does-ieee-754-reserve-so-many-nan-values https://stackoverflow.com/questions/18118408/what-is-the-difference-between-quiet-nan-and-signaling-nan 的特殊值 https://en.wikipedia.org/wiki/NaN $+\infty$ $-\infty$
10.6.2. 最终选择了在mysql表中将所有10行都
UPDATE tbmi_metadata_embedded_exif
SET gpsCoordinate = NULL
WHERE gpsCoordinate IS NOT NULL AND ST_AsWKT(gpsCoordinate) IS NULL;
10 rows affected in 10 s 455 ms
SELECT imageId, gpsCoordinate, st_isvalid(gpsCoordinate)
FROM tbmi_metadata_embedded_exif
WHERE gpsCoordinate IS NOT NULL AND st_isvalid(gpsCoordinate) != 1
ORDER BY gpsCoordinate;
0 rows retrieved in 831 ms (execution: 778 ms, fetching: 53 ms)
后重新执行10.5.
至10.5.5.
3.
可见pgloader在pgsql端按照2.3.
中mysql://
源下默认的CASTING
会将AUTO_INCREMENT
的PKint
及其变种长度类型列转换为serial
类型列 https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL
- `id` int unsigned NOT NULL AUTO_INCREMENT,
+ id bigint primary key not null default nextval('tbmc_f97650_thread_id_seq'::regclass),
11.1. 而其早已被彻底批倒批臭 https://stackoverflow.com/questions/55300370/postgresql-serial-vs-identity
并在pgsql 10中实现了isosql
中的identify
可使用 https://www.2ndquadrant.com/en/blog/postgresql-10-identity-columns/ 中的 https://en.wikipedia.org/wiki/PL/pgSQL 3.3.
命令式存储过程
一键转换
CREATE OR REPLACE FUNCTION upgrade_serial_to_identity(tbl regclass, col name)
RETURNS void
LANGUAGE plpgsql
AS $$
DECLARE
colnum smallint;
seqid oid;
count int;
BEGIN
-- find column number
SELECT attnum INTO colnum FROM pg_attribute WHERE attrelid = tbl AND attname = col;
IF NOT FOUND THEN
RAISE EXCEPTION 'column does not exist';
END IF;
-- find sequence
SELECT INTO seqid objid
FROM pg_depend
WHERE (refclassid, refobjid, refobjsubid) = ('pg_class'::regclass, tbl, colnum)
AND classid = 'pg_class'::regclass AND objsubid = 0
AND deptype = 'a';
GET DIAGNOSTICS count = ROW_COUNT;
IF count < 1 THEN
RAISE EXCEPTION 'no linked sequence found';
ELSIF count > 1 THEN
RAISE EXCEPTION 'more than one linked sequence found';
END IF;
-- drop the default
EXECUTE 'ALTER TABLE ' || tbl || ' ALTER COLUMN ' || quote_ident(col) || ' DROP DEFAULT';
-- change the dependency between column and sequence to internal
UPDATE pg_depend
SET deptype = 'i'
WHERE (classid, objid, objsubid) = ('pg_class'::regclass, seqid, 0)
AND deptype = 'a';
-- mark the column as identity column
UPDATE pg_attribute
SET attidentity = 'd'
WHERE attrelid = tbl
AND attname = col;
END;
$$;
11.2. mysql端一键生成针对便乘serial
的AUTO_INCREMENT
调用上述转换的pgsql
SELECT CONCAT(
'SELECT upgrade_serial_to_identity(\'tbm."',
TABLE_NAME,
'"\', \'',
COLUMN_NAME,
'\');'
)
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'tbm' AND EXTRA LIKE '%auto_increment%'
AND TABLE_NAME REGEXP '^tbm(_|i_(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))'
ORDER BY TABLE_NAME;
REGEXP部分同2.1.
7.2.1. 由于
timestamptz
实际上并没有额外存储时区UTC offset
stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql 因而其并不能解决 z.n0099.net/#narrow/near/86397 old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz news.ycombinator.com/item?id=20212671 中争论的带时区引用未来指定datetime
问题 w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_EventsThe SQL data types 'date', 'time', and 'timestamp' are field based time values which are intended to be zone offset independent: they are actually, technically, floating time values! The data type 'timestamp with time zone' is the zone offset-dependent equivalent of 'timestamp' in SQL.
w3.org/International/wiki/WorkingWithTimeZones#Floating_Time
- Floating Time Some observed time values are not related to a specific moment in incremental time. Instead, they need to be combined with local information to determine a range of acceptable incremental time values. We refer to these sorts of time values as "floating times". Floating times are not attached and should never be attached to a particular time zone. Examples of floating time events include a user’s birth date, an employee’s hire or termination date, or a list of company holidays. For example, suppose that January 1st is considered a holiday in your application. The day "January 1" has a floating-time status as a "holiday". That "day" can begin as early as midnight GMT-14:00 and end as late as midnight of January 2 GMT+12:00, depending on local time. This covers an incremental time range of fifty hours.
https://news.ycombinator.com/item?id=19500640 https://news.ycombinator.com/item?id=30659164 https://www.roji.org/storing-timezones-in-the-db https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#TIMESTAMP_WITH_TIME_ZONE
9.2. 实际上所谓的
opencv_imghash
Starry-OvO/aiotieba#63 (comment) 之averageHash
本质上更类似于 en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有
可运算的正整数
语义之类型存储而应直接视作斑点二进制
en.wikipedia.org/wiki/Binary_large_objectALTER TABLE tbmi_hash ADD COLUMN averageHash_2 BINARY(8) NOT NULL AFTER averageHash; UPDATE tbmi_hash SET averageHash_2 = UNHEX(LPAD(HEX(averageHash), 16, '00')); -- https://stackoverflow.com/questions/58611436/how-do-you-convert-an-integer-into-its-condensed-binary-equivalent-e-g-binary3/58657797#comment138021380_58657797 SELECT COUNT(*) FROM tbmi_hash WHERE averageHash_2 != UNHEX(LPAD(HEX(averageHash), 16, '00')); -- expecting 0 rows returned, then: -- ALTER TABLE tbmi_hash DROP COLUMN averageHash, RENAME COLUMN averageHash_2 TO averageHash;
9.2.1. 事实上
https://github.com/n0099/open-tbm/blob/8eca9443fe19ac7b8f569d1b42dfd1d902250360/c%23/imagePipeline/src/Consumer/HashConsumer.cs#L18-L19
已经使得原byte[]
被反转地解释为了小端序
sharplab.io
using System;
using System.Linq;
byte[] big = [0, 1, 2, 3, 4, 5, 6, 7];
var small = big.Reverse().ToArray();
Console.WriteLine(Convert.ToHexString(big)); // 0001020304050607
Console.WriteLine(BitConverter.ToUInt64(big, 0)); // 506097522914230528
Console.WriteLine();
Console.WriteLine(Convert.ToHexString(small)); // 0706050403020100
Console.WriteLine(BitConverter.ToUInt64(small, 0)); // 283686952306183
SELECT UNHEX(LPAD(HEX(506097522914230528), 16, '00')),
UNHEX(LPAD(HEX(283686952306183), 16, '00'));
UNHEX(LPAD(HEX(506097522914230528), 16, '00')) | UNHEX(LPAD(HEX(283686952306183), 16, '00')) |
---|---|
0x0706050403020100 |
0x0001020304050607 |
而其他地方更常用的xxhash3 https://z.n0099.net/#narrow/near/88989 反而的确是大端序 https://source.dot.net/#System.IO.Hashing/System/IO/Hashing/XxHash3.cs,19a80b2f50453190,references https://github.com/dotnet/runtime/blob/92ca5f3227ada3f1620cd9b0ea631491303e4148/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L73
SELECT t.b, text_reverse, decode(text_reverse, 'hex') AS bytea_reverse
FROM (SELECT '\x0001020304050607'::bytea b) t
LEFT JOIN LATERAL (
SELECT string_agg(to_hex(get_byte(b, x)), '') AS text_reverse
FROM generate_series(octet_length(t.b) - 1, 0, -1) x
) x ON true;
b | text_reverse | bytea_reverse |
---|---|---|
0x0001020304050607 |
76543210 |
0x76543210 |
CREATE OR REPLACE FUNCTION reverse(bytea) RETURNS bytea AS $reverse$
SELECT string_agg(byte,''::bytea)
FROM (
SELECT substr($1,i,1) byte
FROM generate_series(length($1),1,-1) i) s
$reverse$ LANGUAGE sql;
SELECT reverse('\x0001020304050607'::bytea);
DROP FUNCTION reverse(bytea);
reverse |
---|
0x0706050403020100 |
mysql端
UPDATE tbmi_hash SET pHash = REVERSE(pHash), averageHash = REVERSE(averageHash);
-- 14,322,348 rows affected in 20 m 6 s 174 ms
pgsql端
UPDATE tbmi_hash SET "pHash" = reverse("pHash"), "averageHash" = reverse("averageHash");
-- 14,322,348 rows affected in 28 m 56 s 671 ms
2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql
SELECT * FROM pg_stat_activity WHERE application_name = 'pgloader';
可根据
表大小
猜测进度 stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505SELECT table_schema, table_name, pg_relation_size('"'||table_schema||'"."'||table_name||'"')::float8 / 1024 / 1024 "MiB" FROM information_schema.tables WHERE table_schema = 'tbm' ORDER BY 3 DESC;
https://www.postgresql.org/docs/current/functions-admin.html https://stackoverflow.com/questions/41991380/whats-the-difference-between-pg-table-size-pg-relation-size-pg-total-relatio
WITH t AS (
SELECT table_schema, table_name,
('"' || table_schema || '"."' || table_name ||'"')::regclass regclass
FROM information_schema.tables
WHERE table_schema = 'tbm'
)
SELECT table_schema, table_name,
pg_relation_filepath(regclass),
pg_indexes_size(regclass) / 1024 / 1024 i,
pg_table_size(regclass) / 1024 / 1024 t,
pg_relation_size(regclass) / 1024 / 1024 r,
pg_total_relation_size(regclass) / 1024 / 1024 tr
FROM t
ORDER BY tr DESC;
tbmc_f97650_reply
结构
create table tbmc_f97650_reply (
id bigint primary key not null default nextval('tbmc_f97650_reply_id_seq'::regclass),
tid bigint not null,
pid bigint not null,
floor bigint not null,
"authorUid" bigint not null,
"subReplyCount" bigint,
"postedAt" bigint not null,
"isFold" smallint,
"agreeCount" integer,
"disagreeCount" integer,
geolocation bytea,
"signatureId" bigint,
"createdAt" bigint not null,
"updatedAt" bigint,
"lastSeenAt" bigint
);
create unique index idx_23148_pid on tbmc_f97650_reply using btree (pid);
create index "idx_23148_postTime" on tbmc_f97650_reply using btree ("postedAt");
create index "idx_23148_authorUid" on tbmc_f97650_reply using btree ("authorUid");
create index idx_23148_tid on tbmc_f97650_reply using btree (tid);
12.1. 其中业务无关自增PK
之id
属于 https://en.wikipedia.org/wiki/Surrogate_key
此前在mysql中使用此类id代理键
是由于mysql innodb基于clustered index=primary index=PK https://dba.stackexchange.com/questions/195288/what-is-the-formal-definition-of-a-primary-index-and-primary-key 因而INSERT
不连续的PK值会使其不断重新平衡b+tree以保持PK序
与行在表中的物理存储序
同步 https://stackoverflow.com/questions/1251636/what-do-clustered-and-non-clustered-index-actually-mean
而众所周知pgsql如同老mysql myisam使用heap table因而无需不断重新平衡b+tree
https://use-the-index-luke.com/sql/clustering/index-organized-clustered-index
这也是为什么有3.2.
之迫真假随机序因为行物理存储序
在无序的heap而非良序stack aka clusteredindex中本就是随机的 https://www.postgresql.org/docs/current/indexes-index-only-scans.html
但正义执行 https://www.postgresql.org/docs/current/sql-cluster.html 可以在指定索引列
中新的不连续值又INSERT
/UPDATE
写入前让行物理存储序
与指定索引列
同步为一致的良序
12.2. 事实核查:正义削除该id代理键
再加上嵌入于表名中用于水平分表
https://en.wikipedia.org/wiki/Shard_(database_architecture) 的fid
可节省 $1-(1754/1978) \approx 11.3\%$ 的表大小
CREATE TABLE tbmc_reply (fid int NOT NULL, LIKE tbmc_f97650_reply INCLUDING ALL);
ALTER TABLE tbmc_reply DROP COLUMN id, ADD PRIMARY KEY (pid);
DROP INDEX tbmc_reply_pid_idx; -- 重复的UNIQUE(pid)
INSERT INTO tbmc_reply
(fid, tid, pid, floor, "authorUid", "subReplyCount", "postedAt", "isFold", "agreeCount", "disagreeCount", geolocation, "signatureId", "createdAt", "updatedAt", "lastSeenAt")
SELECT 97650 AS fid, tid, pid, floor, "authorUid", "subReplyCount", "postedAt", "isFold", "agreeCount", "disagreeCount", geolocation, "signatureId", "createdAt", "updatedAt", "lastSeenAt"
FROM tbmc_f97650_reply;
VACUUM FULL tbmc_reply;
12.3. 而进一步削除fid
梦回此前的水平分表之fid嵌入表名
ALTER TABLE tbmc_reply DROP COLUMN fid;
VACUUM FULL tbmc_reply;
也只会进一步节省 $1-(1754/1675) \approx 4.7\%$ 约80MiB
11.2. mysql端一键生成针对便乘
serial
的AUTO_INCREMENT
调用上述转换的pgsql
11.3.
SELECT *
FROM information_schema.columns
WHERE table_schema = 'tbm'
AND column_default IS NOT NULL;
varchar(n)
转text
SELECT 'ALTER TABLE tbm."' || table_name || '" ALTER COLUMN "' || column_name || '" TYPE text;'
FROM information_schema.columns
WHERE table_schema = 'tbm' AND data_type = 'character varying';
而索引中的列长度似乎没有被pgloader迁移过来
uniquify index names, preserve index names
MySQL index names are unique per-table whereas in PostgreSQL index names have to be unique per-schema. The default for pgloader is to change the index name by prefixing it with idx_OID where OID is the internal numeric identifier of the table the index is built against.
In somes cases like when the DDL are entirely left to a framework it might be sensible for pgloader to refrain from handling index unique names, that is achieved by using the preserve index names option.
The default is to uniquify index names.
Even when using the option preserve index names, MySQL primary key indexes named “PRIMARY” will get their names uniquified. Failing to do so would prevent the primary keys to be created again in PostgreSQL where the index names must be unique per schema.
一键~生成~eval重命名索引把所有索引重命名为$表名_$字段1_$字段2_...
再把PK索引重命名为_pkey
后缀 https://stackoverflow.com/questions/65069778/postgresql-rename-table-cascade-renaming-to-objects-like-indices-constraints/78485773#78485773
-- https://stackoverflow.com/questions/65069778/postgresql-rename-table-cascade-renaming-to-objects-like-indices-constraints/78485773#78485773
DO $$DECLARE r record;
BEGIN
FOR r IN
SELECT 'ALTER INDEX "' || schema_name || '"."' || index_name ||
'" RENAME TO "' || new_index_name || '";' sql, *
FROM (
SELECT schema_name, table_name, index_name,
table_name ||
CASE constraint_type
-- only uncomment below line when using postgresql<12 https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment109084417_4108266
-- WHEN 'f' THEN '_' || columns[1] -- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment127573826_4108266
WHEN 'p' THEN '' -- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names#comment127553786_4108266
ELSE '_' || string_agg(column_name, '_' ORDER BY colum_index)
END || '_' ||
-- https://stackoverflow.com/questions/15417167/case-when-null-evaluates-to-false
CASE COALESCE(constraint_type, 'NULL')
-- https://stackoverflow.com/questions/4107915/postgresql-default-constraint-names/4108266#4108266
-- https://gist.github.com/popravich/d6816ef1653329fb1745
-- https://stackoverflow.com/questions/8674562/postgresql-is-it-possible-to-provide-custom-name-for-primary-key-or-unique/8674640#8674640
WHEN 'c' THEN 'check'
WHEN 'f' THEN 'fkey' -- TODO: foreign key constraint only exists on the referencing table
WHEN 'p' THEN 'pkey'
WHEN 'u' THEN 'key'
WHEN 'x' THEN 'excl'
WHEN 'NULL' THEN 'idx'
END new_index_name
FROM (
SELECT n.nspname schema_name, t.relname table_name, i.relname index_name, c.contype constraint_type, a.attname column_name, a.attnum colum_index
FROM pg_index x
-- https://stackoverflow.com/questions/37329561/how-to-list-indexes-created-for-table-in-postgres/37330092#37330092
JOIN pg_class t ON t.oid = x.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
JOIN pg_class i ON i.oid = x.indexrelid
-- https://stackoverflow.com/questions/55447819/how-to-get-the-list-of-column-names-for-all-indexes-on-a-table-in-postgresql/55448707#55448707
JOIN pg_attribute a ON a.attrelid = x.indrelid AND a.attnum = ANY(x.indkey)
-- https://www.postgresql.org/docs/current/catalog-pg-constraint.html
LEFT JOIN pg_constraint c ON c.conindid = x.indexrelid AND c.conrelid = t.oid
WHERE n.nspname = 'your_schema'
) t
GROUP BY schema_name, table_name, index_name, constraint_type
) t
WHERE index_name != new_index_name
LOOP
-- https://stackoverflow.com/questions/1348126/postgresql-modify-owner-on-all-tables-simultaneously-in-postgresql/37259655#37259655
EXECUTE r.sql;
END LOOP;
END$$;
12.2. 事实核查:正义削除该
id代理键
再加上嵌入于表名中用于水平分表
https://en.wikipedia.org/wiki/Shard_(database_architecture) 的fid
可节省 1−(1754/1978)≈11.3% 的`表大小
12.4. 一键~生成~eval为主题帖
回复帖
楼中楼
表削除id代理键
并重设PK
DO $$DECLARE r record;
BEGIN
FOR r IN
SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
'" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_tid";' c
FROM information_schema.tables t
WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_thread$'
UNION
SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
'" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_pid";' c
FROM information_schema.tables t
WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_reply$'
UNION
SELECT 'ALTER TABLE "' || table_schema || '"."' || table_name ||
'" DROP COLUMN id, ADD PRIMARY KEY USING INDEX "' || table_name || '_spid";' c
FROM information_schema.tables t
WHERE table_schema = 'tbm' AND table_name ~ '^tbmc_f\d+_subReply$'
LOOP
EXECUTE r.c;
END LOOP;
END$$;
7.1. 使用
5.1.3.
中的symmetric difference FULL OUTER JOIN row(table)
SELECT COUNT(*) FROM mysql."tbm_bilibiliVote" a FULL OUTER JOIN ONLY tbm."tbm_bilibiliVote" b ON row(a) = row(b) WHERE a IS NULL OR b IS NULL;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
https://www.postgresql.org/message-id/flat/20060313104028.GB6714%40svana.org https://wiki.postgresql.org/wiki/PostgreSQL_vs_SQL_Standard#FULL_OUTER_JOIN_conditions
PostgreSQL currently limits such conditions to ones that are implementable without constructing explicit UNION operations; that is, the condition must be hashable, mergeable, or constant. Adding support for arbitrary full joins (which would likely require constructing a UNION of the inner join and the two anti-joins) seems like a lot of work for minimal gain.
https://dba.stackexchange.com/questions/146294/why-does-postgresql-perform-a-seq-scan-when-comparing-a-numeric-value-with-a-big/339617#339617 https://stackoverflow.com/questions/59941708/does-postgres-use-index-on-bigint-columns/78499596#78499596
https://code.jeremyevans.net/2022-11-01-forcing-sequential-scans-on-postgresql.html https://lobste.rs/s/jgl9w2/forcing_sequential_scans_on_postgresql https://pganalyze.com/blog/5mins-postgres-large-integers-causing-sequential-scan-instead-of-using-index https://www.postgresql.org/message-id/flat/001101c404a3%24fb34d320%24fd08a8c0%40steve
https://github.com/n0099/open-tbm/issues/48#issuecomment-2062907512
11.1. 而其早已被彻底批倒批臭 stackoverflow.com/questions/55300370/postgresql-serial-vs-identity 并在pgsql 10中实现了
isosql
中的identify
可使用 2ndquadrant.com/en/blog/postgresql-10-identity-columns 中的 en.wikipedia.org/wiki/PL/pgSQL3.3.
命令式存储过程
一键转换
环境
pgloader
@3.6.10
mysql_fdw
@2.9.1
库名假定为tbm
使用某11年历史的commonlisp程序 https://github.com/dimitri/pgloader 可以在多个数据源(其中有着mysql和pgsql)之间导入导出
1.1 由于 https://github.com/dimitri/pgloader/issues/1211 ,从postgre社区apt source安装的
pgloader 3.6.10-2.pgdg22.04+1
无法连接mysqld1.1.1. 建议从源码自行编译
3.6.10
1.1.2. 清理:
1.2. 由于 https://github.com/dimitri/pgloader/issues/782 ,pgloader只能通过仍在使用
mysql 8.0.34
中deprecated的mysql_native_password
用户登录mysqld1.2.1. 建议ad-hoc地临时创建一个最小权限(只
GRANT
库/表级权限)用户用于pgloader而非修改现有用户的default auth plugin
schema.load
有WITH quote identifiers, schema only
INCLUDING ONLY TABLE NAMES MATCHING ~/^tbm(|i(?!imageInReply)|c_f(78579|2265748|17019292|25459979|27546680))/ ;
2.3. 分开导入
表结构
和表数据
是因为pgloader默认行为是按照指定和mysql://
源下默认的CASTING
创建目的表结构
(除了索引),再COPY
导入表数据
,最后并行{表索引数量}
个CREATE INDEX
并且不等待其完成而是并行地开始下一个表的过程: https://pgloader.readthedocs.io/en/latest/batches.html由于未知原因在
3.6.af8c3c1
中WITH max parallel create index 1
并没有实际作用仍然会执行上述流程所以不得不分开在WITH schema only
的.load
中先完成CREATE INDEX
https://github.com/dimitri/pgloader/issues/1576 以减少io争用2.3.1. 出于类似的原因也建议先禁用
autovacuum
清理:待pgloader完成后再手动
VACUMM FULL
https://dba.stackexchange.com/questions/130496/is-it-worth-it-to-run-vacuum-on-a-table-that-only-receives-inserts/130514#1305142.4.
WITH quote identifiers
是为了方便后续校验两端表数据
是否一致时可以假定列名相同,您可以之后再重命名为postgre社区习惯的snake_case
以减少满屏幕""
(相当于mysql的``
)2.5. 视奸正在导入什么表 https://stackoverflow.com/questions/35319597/how-to-stop-kill-a-query-in-postgresql
可根据
表大小
猜测进度 https://stackoverflow.com/questions/21738408/postgresql-list-and-order-tables-by-size/21738505#21738505tbmc_f97650_thread
结构 mysql端:由pgloader在pgsql端按照
2.3.
中mysql://
源下默认的CASTING
所创建:3.1. 可以通过将表中
每列值
concat起来hash后再aggreate地对每行hash
再hash来得出整个表的hash
mermaid.livemysql
blob类型字段必须套
LOWER(HEX())
因为mysql默认的blob2text
输出类似0xDEADBEAF
形式而pgsql是deadbeaf
pgsql3.2. 然而在大表上两者最终都会耗尽内存而失败,并且需要
LOWER(HEX())
GROUP_CONCAT()
时按照PK或UK排序行否则在pgsql中默认迫真假随机
序一键生成
3.3. 理论上可以通过赋予纯SQL(无需臭名昭著的命令式
存储过程
或外部程序辅助)图灵完备性 https://wiki.postgresql.org/wiki/Mandelbrot_set 导致其踏入了同一条Turing Tarpit
之河流的rCTE
来LIMIT 10000
地每次hash的行并对每10k行aggreate之hash
再aggreate地hash从而减少每行hash所aggreate的字符串长度 mermaid.live某so纯路人
指出可以使用postgre社区牛逼哄哄
的某知名企业EnterpriseDB
所开发的某pgsql扩展 https://github.com/EnterpriseDB/mysql_fdw 通过foreign data wrapper来在pgsql中将外部数据当做迫真物化视图
查询4.1. 安装:
清理:
4.2. 安装该pgsql扩展
清理:
4.3. 在pgsql中创建fdw表
使用
table inheritance
是为了避免在2.
已经pgloader schema.load
后再写一遍fdw表结构
,但这会导致4.3.1. 被继承表
tbmc_f97650_thread
无法再直接UPDATE
/DELETE
返回[HV005] ERROR: system attribute "tableoid" can't be fetched from remote relation
而必须加ONLY
https://github.com/EnterpriseDB/mysql_fdw/issues/300 但仍然可以INSERT
4.3.1.1. 这个bug可能会在未来版本的
mysql_fdw
中修复但在这里并不重要,因为这种迫真假锁全表
正好避免了在校验期间UPDATE
/DELETE
(但防不了INSERT
)(默认事务隔离级别是READ COMMITTED
而非mysql默认的REPEATABLE READ
https://github.com/n0099/open-tbm/issues/32#issuecomment-1404661123 当然您也可以改)了pgsql表导致假阴性
4.3.2. 由于
table inheritance
的本质:也会出现
中的所有结果:也就是被继承表并集了所有继承表(在这里通过查询mysql),相当于
tbmi_ocr_box_Latn
结构: mysql端:由pgloader在pgsql端按照
2.3.
中mysql://
源下默认的CASTING
所创建:5.1. 根据
某so纯路人
symmdifftable有4种写法:5.1.1. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/15333054#15333054
5.1.1.1.
一键生成
5.1.1.2. 也可以使用
NATURAL FULL OUTER JOIN ONLY
https://www.postgresql.org/docs/current/queries-table-expressions.html 省略USING (...)
5.1.2. 但
5.1.1.
和5.1.1.2.
并不能用于有列类型允许NULL
之存在的表如7.
中的tbm_bilibiliVote
表有着因为根据sql特色之 https://en.wikipedia.org/wiki/Three-valued_logic
NULL = NULL
是UNKNOWN
而UNKNOWN
转bool
是false
于是不会被ON
/WHERE
等clause视作相同 https://stackoverflow.com/questions/14366004/sql-server-join-missing-null-values5.1.2.1. 因而需要将
5.1.1.
中的USING (...columns)
展开成对所有列=
operator以及额外的OR (IS NULL AND IS NULL)
5.1.2.1.1. 不对
NOT NULL
的列做额外判断是为了避免7.1.
中的5.1.2.2.
一键生成
5.1.3. https://stackoverflow.com/questions/15330403/find-difference-between-two-big-tables-in-postgresql/49381589#49381589
5.1.3.1. 其中
row(t)
是 https://www.postgresql.org/docs/current/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS5.1.3.2. 这也避免了
5.1.2.
5.1.4. https://stackoverflow.com/questions/6337871/how-can-i-speed-up-a-diff-between-tables
由于
EXCEPT
不是symmdiff之FULL OUTER JOIN
而是SELECT a.* FROM a LEFT OUTER JOIN b
所以只能执行两遍2x耗时正如同没有FULL OUTER JOIN
可用的mysql人 https://stackoverflow.com/questions/4796872/how-can-i-do-a-full-outer-join-in-mysql5.1.4.1. 由于未知bug
UNION ALL
后始终为0行即便单独查询有行5.1.4.2. 这也避免了
5.1.2.
5.1.5. 典型的误用
NOT EXISTS
因为
SELECT * FROM mysql."tbmi_ocr_box_Latn"
在此显然返回非0行
(因而只需要读一行mysql."tbmi_ocr_box_Latn"
几百ms后就会立即返回)从而使得该predicate在非空表上恒真5.1.5.1. 由于未知bug
UNION ALL
后始终为0行
即便单独查询有非0行
5.1.6. 除
5.1.5.
外的3种执行期间均需pgsql消耗1x表大小
的内存(总比3.2.
好)和3x表大小
的临时表
存储5.1.6.1. 可以修改
临时表空间
在fs上的默认路径/var/lib/postgresql/16/main/base/pgsql_tmp
https://dba.stackexchange.com/questions/170661/in-postgres-how-do-i-adjust-the-pgsql-tmp-setting/170665#1706655.2. 总行数对比
compare.sql
有在mysql中codegen出基于5.1.3.
、4.3.
和5.2.
校验用的一键生成
pgsqlREGEXP部分同
2.1.
6.1. 一键bash
tbm_bilibiliVote
结构 https://github.com/n0099/bilibiliVote mysql端:由pgloader在pgsql端按照
2.3.
中mysql://
源下默认的CASTING
所创建:7.1. 使用
5.1.3.
中的symmetric difference FULL OUTER JOIN row(table)
可得 https://stackoverflow.com/questions/47405732/why-does-postgresql-throw-full-join-is-only-supported-with-merge-joinable-or-ha
7.1.1. 使用
5.1.5.
中的non symmetric difference EXCEPT
可得 https://stackoverflow.com/questions/48420438/could-not-identify-an-equality-operator-for-type-json-when-using-distinct
7.1.2. 执行
5.1.2.1.
会有 https://stackoverflow.com/questions/32843213/operator-does-not-exist-json-json
7.1.3.
7.
中可见列replyContent
是json
类型这恰好等于
5.2.
之和
7.1.4. 但即便是
json::text
也仍然相同
7.1.5. 可以在mysql端将该
json
列转text
后再重新按照
2.2.
仅导入该表(INCLUDING ONLY TABLE NAMES MATCHING 'tbm_bilibiliVote'
)7.2. 但即便两端的列
replyContent
重新导入后均为text
类型也仍然有所有行
不同 实际上是由于列postTime
mysql无时区类型datetime
由2.3.
中的pgloadermysql://
默认的CASTING
转为pgsql
有时区类型timestamptz
时由pgloader根据系统时区
UTC+8
而非pgsql时区
UTC+0
转换为UTC
7.2.1. 由于
timestamptz
实际上并没有额外存储时区UTC offset
https://stackoverflow.com/questions/5876218/difference-between-timestamps-with-without-time-zone-in-postgresql 因而其并不能解决 https://z.n0099.net/#narrow/near/86397 https://old.reddit.com/r/PostgreSQL/comments/xpygbh/when_would_i_ever_use_timestamp_over_timestamptz/ https://news.ycombinator.com/item?id=20212671 中争论的带时区引用未来指定datetime
问题 https://www.w3.org/International/wiki/WorkingWithTimeZones#Past_and_Future_Eventshttps://www.w3.org/International/wiki/WorkingWithTimeZones#Floating_Time
再叠加mysql_fdw不论当前
pgsql时区
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-TIMEZONES 是什么都不会在将从mysql读到的datetime
转换为timestamptz
时根据时区重新计算时间而是只修改offset(也就是使用跟mysql类型datetime
语义相同的pgsql类型timestamp
的语义)因而只需要修改
pgsql时区
便校验一致
7.2.2. 如果执行了
7.1.5.
需再在pgsql端将该列类型改回json
或jsonb
tbmi_metadata_gif
结构:由pgloader在pgsql端按照
2.3.
中mysql://
源下默认的CASTING
所创建:8.1. 按照
7.1.3.
有
8.1.1. 但按照
7.1.4.
就无差异8.1.2. 这是由于
jsonb
类型不同于本质text
的json
,会将json解析后存储为pgsql内置类型之集,再由于text
类型受c人最爱的
https://en.wikipedia.org/wiki/Null-terminated_string 影响不支持存储256个字节
中的唯一一个0x00
导致jsonb
也无法存储\u0000
转义后的0x00
进其内部的text
https://github.com/turbot/steampipe-postgres-fdw/issues/118 https://www.postgresql.org/docs/current/datatype-json.html8.2. 该表该列中共有如下
\u0000
https://codepoints.net/U+0000[" ", "Created by TechSmith\u0000"]
["tvc\u0000"]
["Author:\t\tGod\_job\_dave\u0000\u0000\u0000???\u0001\u0000\u0000\u0000\u0000P??\u0001\u0000"]
["Optimized by Ulead SmartSaver!\u0000"]
["CREATOR: gd-jpeg v1.0 \(using IJG JPEG v62\), quality = 75\n\u0000"]
8.2.1. 使用如下3层如同
6.
中codegen迫真元编程
自我复制的quine病毒
可在mysql端检查所有含有0x00
的text
类型及其各个长度变种 https://stackoverflow.com/questions/13932750/tinytext-text-mediumtext-and-longtext-maximum-storage-sizes 列并生成ALTER
列类型为BLOB
(pgloader在pgsql端按照2.3.
中mysql://
源下默认的CASTING
会转换为可存储所有256个字节
的bytea
)和SELECT
查阅的一键生成
mysql:2.1.
ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `userComment` BLOB NOT NULL;
|SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`userComment` AS BINARY) LIKE CONCAT('%', 0x00, '%');
ALTER TABLE tbm.`tbmi_metadata_embedded_exif` MODIFY `xpAuthor` BLOB NOT NULL;
|SELECT * FROM tbm.`tbmi_metadata_embedded_exif` WHERE CAST(`xpAuthor` AS BINARY) LIKE CONCAT('%', 0x00, '%');
8.3. 值得注意的是
2.2.
导入 https://github.com/dimitri/pgloader/issues/1573 不会有截断0x00
后字节的WARNING
且6.
校验 https://github.com/EnterpriseDB/mysql_fdw/issues/299 过程中也无法发现以0x00
结尾text
中的最后一个0x00
字节消失tbmi_hash
结构:由pgloader在pgsql端按照
2.3.
中mysql://
源下默认的CASTING
所创建:9.1. 由于pgsql没有mysql特有(
isosql
不要求)带unsigned
修饰的的int类型 https://stackoverflow.com/questions/20810134/why-unsigned-integer-is-not-available-in-postgresql/59802732#59802732 正如同m$ft精神mvp
最爱的中小实体企业c#工控上位机人上壬
https://z.n0099.net/#narrow/near/94726 除非使用扩展 https://github.com/petere/pguint 因而导致该列上限值从 $2^{64} = 18446744073709551615$ 降到 $2^{63} = 9223372036854775807$
9.2. 实际上所谓的
opencv_imghash
https://github.com/Starry-OvO/aiotieba/pull/63#discussion_r1056941392 之averageHash
https://github.com/n0099/open-tbm/blob/252710bf61ca63cc78eeb6373b4c564e8dbb0217/c%23/imagePipeline/src/Consumer/HashConsumer.cs#L19 本质上更类似于 https://en.wikipedia.org/wiki/Nominal_number 因而不应该使用具有可运算的正整数
语义之类型存储而应直接视作斑点二进制
https://en.wikipedia.org/wiki/Binary_large_object9.3.
一键生成
找出所有存在值 $> 2^{63}$ 的bigint unsigned
列的mysqlREGEXP部分同
2.1.
9.4.
类比
7.1.5.
重新导入该表Comment is too long (maximum is 65536 characters)