Open axch opened 9 years ago
Cardinalities of categorical columns in satellites: Country_of_Operator (79,) Operator_Owner (346,) Users (18,) Purpose (46,) Class_of_Orbit (4,) Type_of_Orbit (8,) Contractor (282,) Country_of_Contractor (54,) Launch_Site (25,) Launch_Vehicle (141,) Source_Used_for_Orbital_Data (38,)
Or ask GUESS to guess IGNORE for these. Large also, I think, means large relative to dataset size.
There are 79 distinct countries and 346 owners, and 356 distinct country-owner pairs. The distribution is very skewed for countries, and a little less so for owners:
Country: USA 486 Russia 117 China (PR) 117 Multinational 55 Japan 43 India 31 United Kingdom 25 Germany 22 Canada 22 ESA 18
Owner: Iridium Satellite LLC 71 Ministry of Defense 67 Globalstar 47 SES (Socit Europenne des Satellites (SES)) 38 Intelsat, Ltd. 33 DoD/US Air Force 32 Chinese Academy of Space Technology (CAST) 28 ORBCOMM Inc. 28 People's Liberation Army (C41) 27 European Telecommunications Satellite Consortium (EUTELSAT) 25 Indian Space Research Organization (ISRO) 24 National Reconnaissance Office (NRO) 21 US Air Force 21 Russian Defense Ministry 16 Chinese Defense Ministry 15
USA--Iridium Satellite LLC 71 Russia--Ministry of Defense 62 USA--Globalstar 47 USA--Intelsat, Ltd. 33 USA--DoD/US Air Force 32 USA--ORBCOMM Inc. 28 China (PR)--Chinese Academy of Space Technology (CAST) 28 China (PR)--People's Liberation Army (C41) 27 Multinational--European Telecommunications Satellite Consortium (EUTELSAT) 25 India--Indian Space Research Organization (ISRO) 24 USA--US Air Force 21 USA--National Reconnaissance Office (NRO) 21
I think that, aside from owner being a large categorical, the issue is that there really isn't a one-to-one mapping for the most common cases. Owners go with not only their countries, but also with "International" and "Multinational", and of course countries have multiple owners.
We could "fix" this by lowering the threshold fraction of the dataset size at which guess declares a variable to be a large categorical to under 346/1179, so perhaps to around 0.25? My initial guess for a good value for that threshold, based on gut feeling alone, was 0.9, and that's probably naïve. This parameter is called distinct_ratio.
I'd like comments on that and the other values nearby, and then I'm glad to make the change.
The operator_owner field of satellites is large in this sense.
It may also help for the schema to accept the size of categorical as a parameter (but: what to do if that's wrong? treat it as an upper bound?)
It would probably also help for GUESS(*) to surface the sizes of the categoricals.