tekasian / intro-data-capstone-biodiversity

0 stars 0 forks source link

category_counts table is slightly incorrect #3

Open asadmehdi785 opened 6 years ago

asadmehdi785 commented 6 years ago

https://github.com/tekasian/intro-data-capstone-biodiversity/blob/297a56430354b6ef4894b9f0bcaa2b78767381c9/Biodiversity_Capstone_Project_Bryan_Leung/biodiversity.py#L165

With this code, we will get a category_counts table which looks like this:

image

However, this is slightly incorrect. Instead of using count(), we would want to use nunique() instead. This will avoid counting the same species more than once. For example, for this code:

category_counts = species.groupby(['category', 'is_protected'])\
                         .scientific_name.nunique().reset_index()

We will get this table:

image

The values are just slightly off, but it can make a difference in the long run. Just wanted to point that out!

tekasian commented 6 years ago

Very important. In other projects I'm working on ('Kaggle'), nunique is implemented more so data isn't overlapped. Thanks for pointing this out.