nemequ / squash-corpus

Designing a new corpus for lossless general-purpose compression
15 stars 2 forks source link

Databases #8

Open nemequ opened 9 years ago

nemequ commented 9 years ago

There are two interesting subsets here (that I can think of). The obvious one is backups for things like MySQL, PostgreSQL, etc. The MusicBrainz database could be a good source here.

The other use case is a piece of data from a database which stores its contents compressed on disk. For example, LevelDB compresses chunks of data (IIRC they are each 4 KiB, but I'm not certain) before persisting it to disk.

nemequ commented 8 years ago

Spoke to some people on IRC in the postgresql channel. For postgres there are several options for backups, but if we need to choose one apparently the best choice would be pg_dump -F p.

1 GiB is apparently pretty much the smallest database considered to be of respectable size, and that is obviously way too large to include in this corpus. There is a Sample Databases page on the postgres wiki, they're all pretty large. Perhps load one into postgres and delete a random set of rows (like 90% of them), then use that…