本文档是根据FileGDB逆向工程推导出的标准规范，还在不断完善中。

包括：.gdbtable, .gdbtablx, .gdbindexes, .atx, .spx 和 .freelist文件。

除非另有说明，本文档针对的是FileGDB v10及其以下版本。

原文地址：https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec

在线进制转换工具： https://tool.oschina.net/hexconvert/

在线ASCII码与10进制和字符对比表：https://zhuanlan.zhihu.com/p/408357733?ivk_sa=1024320u

关于字节序：https://www.ruanyifeng.com/blog/2016/11/byte-order.html

shape file格式：https://www.esri.com/content/dam/esrisites/sitecore-archive/Files/Pdfs/library/whitepapers/pdfs/shapefile.pdf

geojson格式：https://www.rfc-editor.org/rfc/rfc7946#section-3.1.7

gdb内要素的组织方式与shp文件有些类似

在做这件事之前你需要对数据类型、进制、编码类型等概念有清析的认识

约定规范：

ubyte: unsigned byte
int16: little-endian 16-bit integer
int32: little-endian 32-bit integer
float64: little-endian 64-bit IEEE754 floating point number
utf16: string in little-endian UTF-16 encoding
string: (UTF-8 ?) string

在本文档中row和feature是同义词。

一个bit是一个0或1,中文叫做一个二进制位。一个byte是8个bit,中文名称叫一个字节

GDB的文件结构

文件命名格式为a[number in lowercase hex].[extension]， a00000001 是第一个文件， a00000002是第二个文件，且数字可能被跳过。

FileGDB v10

在 FileGDB v10中，前8个文件 (a00000001 to a00000008) 是固定不变的内置文件，被预留来保存数据库本身的元数据，后续的文件被用来存放实际的要素信息(a00000009, a0000000a, ...)

（议者注：GDB看作一个整体，将整个数据库的文件列表、整个数据库的配置信息、每张数据表的坐标系、每张表的metadata、GDB_ItemRelationships、GDB_ItemRelationshipTypes、GDB_ItemTypes等单独抽离出来存储。从a000000009开始的表才真正存放用户的实际数据，且a000000009之后的所有文件内不存放坐标系和metadata，只存放字段信息和每一行数据。暂时还不清楚GDB_ItemRelationships、GDB_ItemRelationshipTypes、GDB_ItemTypes到底是什么，可能是拓扑检查需要的东西）

a00000001 也叫 GDB_SystemCatalog 包含一个本数据库中所有文件的列表，也包括它本身。这里记录的表格在磁盘上有可能找不到，比如a00000008。一行记录中的FID记录了文件名。

例如，FID 37的记录(这里采用的FID编号惯例是从1开始的)将在文件a00000025中(译者注：10进制的37用16进制表示为25)。在这个目录表中可能有被删除的行，因此在FID编号中存在空白。

在.a00000001.gdbtable中暂时没找到FID这一个属性列，里面的ID列实际类型是objectid，所以不能直接拿来对比使用。

这个表里还包含Name和FileFormat字段。FileFormat字段的值多数时候是0，在少部分内置保留表中是2。

a00000002 叫 GDB_DBTune，主要包含本数据库的一些配置信息。
a00000003也叫GDB_SpatialRefs，主要用于存储坐标系信息，坐标系以WKT形式存储在SRTEXT字段中，WKT是以ESRI的WKT格式组织的，其它还包括：FalseX, FalseY, XYUnits, FalseZ, ZUnits, FalseM, MUnits, XYTolerance, ZTolerance, MTolerance字段。

所有行都是唯一的，所以如果有3个Feature类，它们都具有相同的坐标系，但其中一个具有不同的ZTolerance，那么就会有两行记录。

a00000004 也叫GDB_Items 并且包含layers的 metadata，以XML格式表示。字段包括 :

UUID (UUID) : UUID
Type (UUID) : item type
Name (string) : item/layer name. Matches the Name field of the GDB_SystemCatalog
PhysicalName (string) : item/layer name in upper case characters.
Path (string) : "\mylayername" for top-level layers or "\myfeaturedataset\mylayername" for layers attached to a feature dataset "myfeaturedataset"
DatasetSubType1 (int32) : 1 for user tables (TBC)
DatasetSubType2 (int32) : layer geometry type. 1 for point layer, 2 for multipoint layers, 3 for linestring layers, 4 for polygon layers
DatasetInfo1 (string) : "SHAPE" for user tables (TBC)
DatasetInfo2 (string) : NULL for user tables (TBC)
URL (string) : empty string (TBC)
Definition (XML) : DEFeatureClassInfo XML element. Contains an XML version of the information that can be obtained by parsing the header of a table : fields, SRS, ...
Documentation (XML) : metadata XML element
ItemInfo (XML) : NULL for user tables (TBC)
Properties (int32) : 1 for user tables (TBC)
Defaults (binary) : absent for user tables (TBC)
Shape (geometry) : 5 point polygon listing the corner of the bounding box of the layer reprojected into EPSG:4326 (even if the layer SRS is not EPSG:4326). Or missing if the layer SRS is undefined.

一些特殊记录：

The first record is reserved for a kind of root item ( Name = "", Path = "" ).
The second record is reserved for a Name = "Workspace" item, Path = "", Definition containing a DEWorkspace XML element
When there are feature datatesets, they also appear as records : e.g. Name = "featuredataset", PhysicalName = "FEATUREDATASET", Path = "\FEATUREDATASET", Definition containing a DEFeatureDataset XML element

a00000005, a00000006 and a00000007 are one of GDB_ItemRelationships,GDB_ItemRelationshipTypes or GDB_ItemTypes (order may vary depending on datasets)
a00000008虽然在记录中有，但实际磁盘上可能会不存在。

FileGDBv9

暂时不想写......

.gdbtable文件规范

.gdbtable文件描述字段并包含行数据。

包括header、field、row三部分内容。

Header (40 bytes)

int32: == 3 - version of the format?
int32: number of (valid) rows
int32: maximum of row sizes and size of field description section
int32: == 5 - unknown role. Constant among the files
4 bytes: varying values - unknown role. Seems to be 0x00 0x00 0x00 0x00 for FGDB 10 files, but not for earlier versions
4 bytes: 0x00 0x00 0x00 0x00 - unknown role. Constant among the files
int64: file size in bytes
int64: offset in bytes at which the field description section begins (often 40 in FGDB 10). Note: datasets with 5 significant bytes (ie beyond 4GB) have been found per https://trac.osgeo.org/gdal/ticket/6830.

Field 部分

固定部分

int32: size of header in bytes (this field excluded)
int32: version of the file. 3 for FGDB 9.X files and 4 for FGDB 10.X files. No other known values.
uint32: layer flags, including geometry type:
1. bits 0 - 7: (i.e. flag & 0xff) geometry type:
  
  0 = none 1 = point 2 = multipoint 3 = (multi)polyline 4 = (multi)polygon 5 = rectangle (envelope) 6 = "path" 7 = mixed/any geometry type 9 = multipatch 11 = ring 13 = line 14 = circular arc 15 = bezier curves 16 = elliptic curves 17 = geometry collection (any types) 18 = triangle strip 19 = triangle fan 20 = ray 21 = sphere 22 = TIN
2. bit 8: string encoding. Set for UTF-8 encoded strings. If not set, UTF-16 strings are used (affects feature strings and field default values)
3. bit 9: (or bits 10 or 12) likely an indicator of whether the database uses "high precision storage" or not. Always 1 in all encountered files, and according to the ESRI docs, it hasn't been possible to make low precision gdbs since 9.2
4. bit 10: possibly storage type, see bit 9
5. bit 11: unknown
6. bit 12: possibly storage type, see bit 9
7. bit 30: geometry has M values
8. bit 31: geometry has Z values
int16: number of fields (including geometry field and implicit OBJECTID field)

重复部分（每一个field都有）

紧接着是:字段的描述(重复次数与字段的数量相同)

ubyte: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
utf16: alias of the field (ommitted if previous field is 0)
ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

字段说明的下一个字节取决于字段类型

field type = 4 (string),

int32: maximum length of string
ubyte: flag
varuint: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes with the default value numeric

field type = 6 (objectid),

ubyte: unknown role = 4
ubyte: unknown role = 2

field type = 7 (geometry),
ubyte: unknown role = 0
ubyte: flag = 6 or 7. If lsb is 1, the field can be null.
int16: length (in bytes) of the WKT string describing the SRS.
string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS (which corresponds to the COM CLSID for the ESRI UnknownCoordinateSystem class http://desktop.arcgis.com/en/arcobjects/latest/net/webframe.htm#UnknownCoordinateSystem.htm.
ubyte: flags. Combination of values:

(1<<0) seems to be systematically set (only bit for system table a00000004.gdbtable ) (1<<1) indicates has_z = true (1<<2) indicates has_m = true
float64: xorigin 坐标原点x值
float64: yorigin 坐标原点y值
float64: xyscale 比例尺
float64: morigin (present only if has_m = True)
float64: mscale (present only if has_m = True)
float64: zorigin (present only if has_z = True)
float64: zscale (present only if has_z = True)
float64: xytolerance
float64: mtolerance (present only if has_m = True)
float64: ztolerance (present only if has_z = True)
float64: xmin of layer extent (might be NaN)
float64: ymin of layer extent (might be NaN)
float64: xmax of layer extent (might be NaN)
float64: ymax of layer extent (might be NaN)

If geometry has z values (bit 31 of layer geometry type flags):

float64: zmin of layer extent (might be NaN)
float64: zmax of layer extent (might be NaN)

If geometry has m values (bit 30 of layer geometry type flags):

float64: mmin of layer extent (might be NaN)
float64: mmax of layer extent (might be NaN)

Then, values relating to the spatial index for the field:

a byte always at 0 (possibly an indicator of existence of spatial index or its type?)
a uint32 whose value is 1, 2 or 3, indicating the number of spatial grid sizes (see e.g. http://desktop.arcgis.com/en/arcmap/10.3/tools/data-management-toolbox/add-spatial-index.htm for more details about spatial grid sizes)
for each grid size, float64: spatial index grid resolution at this level (referenced as grid_size[] in later section describing .spx files). ESRI software enforces grid_size[1] >= 3 grid_size[0] and grid_size[2] >= 3 grid_size[1]

field type = 8 (binary),

ubyte: unknown role
ubyte: flag

field type = 9 (raster),

ubyte: unknown role
ubyte: flag. If lsb is 1, the field can be null.
ubyte: number of UTF-16 characters (not bytes) of the following string
utf16: string whose value seems to be "Raster Column"
int16: length (in bytes) of the WKT string describing the SRS.
string: WKT string describing the SRS Or {B286C06B-0879-11D2-AACA-00C04FA33C20} for no SRS .
ubyte: flags. Value is generally 1 (has_z = has_m = false, generally for system tablea00000004.gdbtable ), 5 (has_z = true, has_m = false) or 7 (has_z = has_m = true). If 0, none of the following float64 values is present : the next one is the ubyte of unknown role.
float64: xorigin
float64: yorigin
float64: xyscale
float64: morigin (present only if has_m = True)
float64: mscale (present only if has_m = True)
float64: zorigin (present only if has_z = True)
float64: zscale (present only if has_z = True)
float64: xytolerance
float64: mtolerance (present only if has_m = True)
float64: ztolerance (present only if has_z = True)
ubyte: raster_type (0=if raster is stored externally, 1=if raster is managed within filegdb, 2=if raster is inlined)

field type = 10, 11 (UUID)

ubyte: width : 38
ubyte: flag

field type = 12

ubyte: width : 0
ubyte: flag

其它field types,

ubyte: width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime)
ubyte: flag
ubyte: ldf = length of default value in byte if (flag&4) != 0 followed by ldf bytes

如果标志字段的lsb(当存在时)设置为1，那么记录中该字段可以为空

Rows

行部分不一定紧跟着最后一个字段说明，它通常在几个字节之后开始，但不是以一种可预测的方式。

注意:

对于ESRI FGDB SDK API创建的FGDB layers，字段描述部分的结束和行部分的开始之间有4个字节:0xDE 0xAD 0xBE 0xEF

rows部分是一个X行的序列(其中X是. gdbtablex中发现的features的总数，可能与.gdbtable头文件中发现的有效行数不同)

Row具体描述

int32: length in bytes of the row blob ( this field excluded) ceil(number_nullable_fields / 8) * ubyte: 通过一个flags来标记哪些字段是空的，number_nullable_fields指可以为空的字段，这在arcgis里面能看到哪些字段可以为空，objectid不能为空所以不能参与这里的运算，shape字段可以为空所以要参与这里的运算，数出有多少个可以为空的字段后除以8然后向上取整，就知道应该保留多少个bytes来记录这些信息了。指具体内容如下。

Null fields flags

这个地方记录方法是使用n个bytes来存放字段为空的信息，n的计算方法ceil(number_nullable_fields / 8)，但实际存放是通过8位的二进制bit来控制的，如：11111100表示前两个字段不为空。1代表该字段没有值，0代表该字段有值，而且排序是从后面往前排的，通常第一个字段是shape空间数据字段。如果字段比较多是用两个或多个bytes来存放这些信息的也需要整体从最后开始倒排。我们在用flexhex调试查看时是看到的16进制的数据而不是二进制的bit。

Each bit of the flags field encode for the presence or absence of the field content, for a nullable field, for the row. The flag is set to 1 if the field is missing/null (1 is used as well for spare bits), or 0 if the field is present/non-null. The flag for the first field, in the order of the fields of the field description section (typically the geometry), is the least significant bit of the first byte of the flags field.

There are no bits reserved for non-nullable fields.

If all fields are non-nullable, the flag field is absent.

Note: there's no explicit data for OBJECTID and no reserved flag bit for it.

For each non-null field, the field content is appended in the order of the fields of the field description section.

string类型字段值是用utf-8进行编码的（这一点在英文版文档中没有注明）

.gdbtablx文件规范

.gdbtablx文件包含.gdbtable的row的偏移信息。

Header (16 bytes)

4 bytes: 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ?
int32: n1024BlocksPresent = number of blocks of offsets for 1024 features that are effectively present in that file (ie sparse blocks are not counted in that number).
int32: number_of_rows : number of rows, included deleted rows
int32: size_offset = number of bytes to encode each feature offset. Must be 4 (.gdbtable up to 4GB), 5 (.gdbtable up to 1TB) or 6 (.gdbtable up to 256TB)

Offset section

6D 02 00 00 00 是一个以16进制编码的int32类型little-endian数值，实际16进制可表示为0x26D，转换为10为621，与.gdbtable中的实际一至

.atx文件规范

数据的逻辑表名和实际存放数据的文件名的对应关系就是存放在a00000001.TablesByName.atx文件中的。

.atx 记录了 .gdbtable文件某一字段的索引。通常，该字段在.gdbtable中接受的值按照相关FID的升序列出。.atx 文件以4096 bytes进行分页，并且根据字段值的size和.gdbtable表中features的个数进行分层组织。 The first page is 1, so page N is located at offset (N-1)*4096.

The reading of .atx files must start with its trailing section.

Trailing section (22 bytes)

这一部分在文件末尾处，可以直接取文件末尾22 bytes。

byte: 被索引字段值以bytes为单位计算的长度，后面这个值以size_value表示。它与被索引的字段类型有很大的关系。int16 it is equal to 2. For int32: 4. For float32: 4. For float64: 8. For string: variable number that is a multiple of 2 (string values are encoded as UTF16 characters, so 2 bytes per character) and at maximum 160 bytes (80 characters). For datetime: 8. For UUID: 38 ( the string representation is 38 bytes. See above). Indexing of binary or XML fields has not been studied (if it is possible !)
byte: unknown role
int32: unknown role. Apparently always/often 1.
uint32: index depth >= 1. If it is 1 the first page directly references features. Otherwise the first page reference pages that reference pages referencing features (depth = 2), or pages that reference pages that reference pages that reference features (depth = 3), and so on...
uint32: number of features referenced in the file. Otherwise said number of features that have a non-null value for the field being indexed. Must not be greater than the number of valid features of the .gdbtable. It has been observed that (with FileGDB SDK 1.3) this value is not relieable for an index that has been built while features are inserted, if the values inserted are not in increasing order.
int32: unknown role. Apparently always/often 0.
int32: unknown role. Apparently always/often 1.

The maximum number of features (or sub-pages references) in a page is : nMaxPerPages = (4096 - 12) / (4 + size_value)

The offset at which field values are found in a page is : nOffsetFirstValInPage = 12 + nMaxPerPages * 4

Page referencing features (4096 bytes)

For a given field value, if found in several features, the features are sorted by ascending ID. The structure of such a page is header section (12 bytes), followed by FID numbers (maximum of 4 nMaxPerPages bytes), a few potential padding bytes, and finally field values (maximum of size_value nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that if index_depth == 1, there is a single feature page, and for higher index depth, all feature-referencing pages are referenced from page referencing pages. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
uint32: number of features referenced in the page (nFeatures). Not greater than nMaxPerPages
uint32: unknown role. Apparently always/often 0.

FID section structure (offset 12 in the page) :

uint32: FID of the first feature referenced in the page
...
uint32: FID ot the (nFeatures)th feature referenced in the page.

Padding section of zeroes (size: nOffsetFirstValInPage - 12 - 4 * nFeatures)

Values section structure (offset nOffsetFirstValInPage in the page):

type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): value of field for the first feature referenced in the page
...
type: value of field for the (nFeatures)th feature referenced in the page.

Page referencing other pages (4096 bytes)

The structure of such a page is header section (4 bytes), followed by sub-pages numbers (maximum of 4 (1 + nMaxPerPages) bytes), a few potential padding bytes, and finally field values (maximum of size_value nMaxPerPages bytes)

Header section structure (offset 0 in the page) :

uint32: ID of the next page at the same depth, or 0 for last page. Not strictly needed to use the index (under the assumption that such a page is always referenced from a page upper in the hierarchy if there are several at that depth. Such assumption seems to match with how indices are generated, and is a good practice for efficient hiearchical indexing)
uint32: number of sub-pages referenced in the page (nSubPages). Not greater than nMaxPerPages

Sub-pages number section (offset 8 in the page):

uint32: ID of the first sub-page referenced in the page
...
uint32: ID of the (nSubPages)th sub-page referenced in the page.
uint32: ID of the (nSubPages+1)th sub-page referenced in the page (note: there is no maching value for that last sub-page number in the values section)

Padding section of zeroes( size: nOffsetFirstValInPage - 8 - 4 * (nSubPages+1))

Values section structure (offset nOffsetFirstValInPage in the page):

type depending on the field (int16/int32/float32/float64/datetime as float64/string as UTF16 characters/UUID): maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the first sub-page referenced in the page
...
type: maximum value of field taken in the features referenced by the sub-page (and its potential sub-sub-pages) for the (nSubPages)th sub-page referenced in the page

空间信息

首先，从TOC中寻求SINFO偏移

float64: x min (Extent of layer)
float64: y min
float64: x max
float64: y max
float64: unknown -- maybe resolution?

如果z或m存在，看起来就像是两组双精度组合——可能是z/m min/max，但目前还不知道是哪个顺序

varuint: number of UTF-16 characters (not bytes) of the WKT definition of the table's SRS
utf16: WKT definition of table's SRS

Number of bytes of the string as a varuint,这句话比较难以理解，实际上表示将我们看到的字符（字母需要查ascii码对照表）或数字（数字默认为10进制所以不用查了），转换为二进制，然后求bytes的个数。实际上二进制中8bit为一个bytes，所以拿二进制的长度除8，向上取整就是最终要用的varuint。另外，varuint可以理解为var uint，即可变个数的uint。

在geometry编码时，xorigin和yorigin表示该字段的坐标原点，xyscale表示gis概念中的scale。

objectid列的属性值并没有存放在.gdbtable文件的row->Field content部分。

用Go lang解析时可能遇到的问题

用Go解析最大的问题就是数据类型的变换不熟悉。

当使用bufio库读取到数据时是一个[]byte数组数据，我们需要的是它10进制的值，虽然源文件中存的是16进制数据，但读出来时默认看到的已经是10进制的值了，所以想办法取出来直接用就行了，我们这里想取出为uint64的类型，就直接uint64(bytedata[0)就可以了。如果想取它的ascii码值则可用string(bytedata[0])

在官方python脚本中，read_uint8就是返回uint8类型的数据，read_float32就是返回float32位的数据，以此类推

本工程除必要情况外，暂不考虑z和m的问题。

liujiusheng / blog

ESRI File Geodatabase (FileGDB)标准规范的翻译 #247