probcomp / bdbcontrib

BayesDB contributions, including plotting, helper methods, and examples
http://probcomp.csail.mit.edu/bayesdb
Apache License 2.0
9 stars 6 forks source link

Our demos should not have large categoricals, because those mess us up #63

Open axch opened 9 years ago

axch commented 9 years ago

The operator_owner field of satellites is large in this sense.

It may also help for the schema to accept the size of categorical as a parameter (but: what to do if that's wrong? treat it as an upper bound?)

It would probably also help for GUESS(*) to surface the sizes of the categoricals.

axch commented 9 years ago

Cardinalities of categorical columns in satellites: Country_of_Operator (79,) Operator_Owner (346,) Users (18,) Purpose (46,) Class_of_Orbit (4,) Type_of_Orbit (8,) Contractor (282,) Country_of_Contractor (54,) Launch_Site (25,) Launch_Vehicle (141,) Source_Used_for_Orbital_Data (38,)

gregory-marton commented 8 years ago

Or ask GUESS to guess IGNORE for these. Large also, I think, means large relative to dataset size.

gregory-marton commented 8 years ago

There are 79 distinct countries and 346 owners, and 356 distinct country-owner pairs. The distribution is very skewed for countries, and a little less so for owners:

Country: USA 486 Russia 117 China (PR) 117 Multinational 55 Japan 43 India 31 United Kingdom 25 Germany 22 Canada 22 ESA 18

Owner: Iridium Satellite LLC 71 Ministry of Defense 67 Globalstar 47 SES (SociŽtŽ EuropŽenne des Satellites (SES)) 38 Intelsat, Ltd. 33 DoD/US Air Force 32 Chinese Academy of Space Technology (CAST) 28 ORBCOMM Inc. 28 People's Liberation Army (C41) 27 European Telecommunications Satellite Consortium (EUTELSAT) 25 Indian Space Research Organization (ISRO) 24 National Reconnaissance Office (NRO) 21 US Air Force 21 Russian Defense Ministry 16 Chinese Defense Ministry 15

USA--Iridium Satellite LLC 71 Russia--Ministry of Defense 62 USA--Globalstar 47 USA--Intelsat, Ltd. 33 USA--DoD/US Air Force 32 USA--ORBCOMM Inc. 28 China (PR)--Chinese Academy of Space Technology (CAST) 28 China (PR)--People's Liberation Army (C41) 27 Multinational--European Telecommunications Satellite Consortium (EUTELSAT) 25 India--Indian Space Research Organization (ISRO) 24 USA--US Air Force 21 USA--National Reconnaissance Office (NRO) 21

I think that, aside from owner being a large categorical, the issue is that there really isn't a one-to-one mapping for the most common cases. Owners go with not only their countries, but also with "International" and "Multinational", and of course countries have multiple owners.

We could "fix" this by lowering the threshold fraction of the dataset size at which guess declares a variable to be a large categorical to under 346/1179, so perhaps to around 0.25? My initial guess for a good value for that threshold, based on gut feeling alone, was 0.9, and that's probably naïve. This parameter is called distinct_ratio.

I'd like comments on that and the other values nearby, and then I'm glad to make the change.