Use of Total Population in Dissimilarity Index

PFurst2000 commented 1 year ago

My understanding of the dissimilarity index is that it typically compares a smaller population to a larger population group. The formula for this package suggests that one compares a smaller population to the total population. Is this an error or is there a rationale for this design? How would one setup up a query to analyze an example such as Black non-Hispanics vs White non-Hispanics? When I replaced Total Population for White non-Hispanics I received the following error: ValueError: Group of interest population must equal or lower than the total population of the units.

My query design for this error is below: ac = Dissim(group_data, group_pop_var='BlackNH_20', total_pop_var='WhiteNH_20') result = ac.statistic results.append({'CouSubDiv': group, 'Dissim_BlackNH_20': result})

knaaptime commented 1 year ago

hey, thanks for raising this. Good point, and a couple responses. Typically the literature distinguishes between "single-group" and "multigroup" indices, and in the package, the APIs for single-group indices and multigroup indices are different, which may contribute to the confusion here.

My understanding of the dissimilarity index is that it typically compares a smaller population to a larger population group. The formula for this package suggests that one compares a smaller population to the total population.

yep. These aren't mutually exclusive, so in the case that you want a "single-group" index, your total population is group1 + not_group1.

In your example, you just need to create an intermediate column to hold the 'total':

# create a new total that stores sum(black, white)
# alternatively you might look at black vs non-black, so this gives you control over your reference population
group_data['total_blackwhite'] = group_data['BlackNH_20'] + group_data['WhiteNH_20']
# pass in the new total. Now your stat refers to black versus white.
ac = Dissim(group_data, group_pop_var='BlackNH_20', total_pop_var='total_blackwhite')

you could also treat this like a multi-group problem with only two groups. In that case you would do

from segregation.multigroup import MultiDissim

ac = MultiDissim(group_data, groups=['BlackNH_20', 'WhiteNH_20'])

in that case, your total population is the sum of all input groups, so it's created for you.

How would one setup up a query to analyze an example such as Black non-Hispanics vs White non-Hispanics?

I think maybe the code above solves your problem, but i'd elaborate that I think it depends on how you want to set this problem up. Do you want to understand how blacks are segregated in a multiracial place (and, how whites are segregated in a multiracial place) then compare the two numbers? or do you want to want to ignore other population groups and focus on a single segregation statistic that examines how blacks and whites are distributed relative to one another?

giving you control over that research question is ultimately why the singlegroup and multigroup indices have different inputs. The singlegroup approach assumes you care about one group versus anybody else. The multigroup approach assumes you've given an exhaustive list of the groups you care about

PFurst2000 commented 1 year ago

Thank you Eli! It is very nice that the software package is so flexible. Your explanation makes sense and I was able to successfully run both options. I am going to close this issue as it is resolved.

pysal / segregation

Use of Total Population in Dissimilarity Index #213