nemequ / squash-corpus

Designing a new corpus for lossless general-purpose compression
15 stars 2 forks source link

RPC requests/responses / data packets #9

Open nemequ opened 9 years ago

nemequ commented 9 years ago

Small chunks of data which are often compressed. Some JSON, and maybe some Protocol Buffers? MessagePack? BSON?

nemequ commented 8 years ago

Added technion.json. May still be a good idea to add some protocol buffer data (though I don't have any good ideas about where to get it), so I'm leaving this open for now.

nemequ commented 7 years ago

I ended up removing technion.json for licensing reasons, but I'd really like to re-add it. It's just the response from a query to the Google Geocoding API for the building which houses the computer science department at Technion, where Abraham Lempel and Jacob Ziv were working when they created LZ77/78.

The URL (which you need an API key for) is https://maps.googleapis.com/maps/api/geocode/json?place_id=ChIJxaX-CJG6HRURcYTahYYyZ9A&key=YOUR_API_KEY

The content is

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "Henry and Marilyn Taub & Family Science and Technology Center -Faculty of Computer Sciences",
               "short_name" : "Henry and Marilyn Taub & Family Science and Technology Center -Faculty of Computer Sciences",
               "types" : [ "establishment", "point_of_interest" ]
            },
            {
               "long_name" : "Haifa",
               "short_name" : "Haifa",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Haifa",
               "short_name" : "Haifa",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "Haifa District",
               "short_name" : "Haifa District",
               "types" : [ "administrative_area_level_1", "political" ]
            },
            {
               "long_name" : "Israel",
               "short_name" : "IL",
               "types" : [ "country", "political" ]
            }
         ],
         "formatted_address" : "Henry and Marilyn Taub & Family Science and Technology Center -Faculty of Computer Sciences, Haifa, Israel",
         "geometry" : {
            "location" : {
               "lat" : 32.7777324,
               "lng" : 35.0216216
            },
            "location_type" : "APPROXIMATE",
            "viewport" : {
               "northeast" : {
                  "lat" : 32.77908138029149,
                  "lng" : 35.02297058029149
               },
               "southwest" : {
                  "lat" : 32.77638341970849,
                  "lng" : 35.02027261970849
               }
            }
         },
         "place_id" : "ChIJxaX-CJG6HRURcYTahYYyZ9A",
         "types" : [ "establishment", "point_of_interest", "university" ]
      }
   ],
   "status" : "OK"
}

Someone suggested that the easiest way to get an appropriate license applied to it would be to include it as test data for Brotli. @eustas, is this something you'd be willing/able to help with?

eustas commented 7 years ago

Hello. I'll see what I can do. This looks like a good real-life thing that is worthy to measure against.

luvarqpp commented 4 years ago

Hi, for protobuf data, you can have a look at pbf formatted map data from open street map. Some country extracts can be found for example here: https://download.gisgraphy.com/openstreetmap/pbf/

nemequ commented 4 years ago

@luvarqpp, I like the idea of using OSM data, thanks! This seems like it would be perfect for representing a stream of RPC responses.