jonathanvx / SR24

USDA National Nutrient Database for Standard Reference
3 stars 1 forks source link

Sum of nutr_val is higher than 100gr #1

Open xavierpena opened 8 years ago

xavierpena commented 8 years ago

Hi and thanks for this fantastic compilation of data.

I made a tutorial about how to set up the data (while I was doing it for myself, I thought it would be a good idea to share it): http://localcode.wikidot.com/playing-with-a-food-database

Maybe github is not the right place to ask about how the underlying data works, but using your database I made the sum of nutr_value for a single food description (in this case: "blue cheese") focusing only in units that were in grams. As I understand it, the values are "per 100gr of food", so I expected a sum lower than 100. I got 177 instead. The error is more likely to be because of my interpretation of the data rather than an error in the database, but I wanted to ask in case you or anyone else understood what was going on here.

SELECT 
    sum(nutr_val)
FROM
    nutrient_data ndata
        JOIN
    nutrient_definition ndef ON ndata.nutr_no = ndef.nutr_no
        JOIN
    food_desc fdesc ON fdesc.ndb_no = ndata.ndb_no
        JOIN
    food_grp_desc fgrpdesc ON fgrpdesc.fdgrp_cd = fdesc.fdgrp_cd 
WHERE
    long_desc = 'Cheese, blue' and units_text = 'g';

Note: to run this query you need to first transform units (in blob in the original database) to unit_text (as varchar). As explained in my blog post, it's done like so:

ALTER TABLE nutrient_definition ADD units_text VARCHAR(10);
UPDATE nutrient_definition SET units_text=CAST(units AS CHAR(10000) CHARACTER SET latin1);
jonathanvx commented 8 years ago

I think you are missing another condition on the where clause as it appears you are summing different nutrition categories together.

Use this query and see which additional condition you may need:

SELECT
    ndata.nutr_val,ndef.nutr_no, ndef.nutrdesc
FROM
    nutrient_data ndata
        JOIN
    nutrient_definition ndef ON ndata.nutr_no = ndef.nutr_no
        JOIN
    food_desc fdesc ON fdesc.ndb_no = ndata.ndb_no
        JOIN
    food_grp_desc fgrpdesc ON fgrpdesc.fdgrp_cd = fdesc.fdgrp_cd
WHERE
    long_desc = 'Cheese, blue' and units_text = 'g';
xavierpena commented 8 years ago

Thank you for your answer.

As you can see, I know little about nutrition. I am just a developer trying to make sense of the data.

different nutrition categories

I re-examined the tables and I don't know how to find these different nutrition categories. The only plausible hypothesis I could come up with is:

I don't know if you could confirm if that makes sense or not.

jonathanvx commented 8 years ago

Yeah, I think there is some duplication. For example: SELECT ndata.nutr_val,ndef.nutr_no, ndef.nutrdesc FROM nutrient_data ndata JOIN nutrient_definition ndef ON ndata.nutr_no = ndef.nutr_no JOIN food_desc fdesc ON fdesc.ndb_no = ndata.ndb_no JOIN food_grp_desc fgrpdesc ON fgrpdesc.fdgrp_cd = fdesc.fdgrp_cd WHERE long_desc = 'Cheese, blue' and units_text = 'g' and (ndef.nutrdesc like '%lipid%' or ndef.nutrdesc like '%fat%') ;

+----------+---------+------------------------------------+ | nutr_val | nutr_no | nutrdesc | +----------+---------+------------------------------------+ | 28.74000 | 204 | Total lipid (fat) | | 18.66900 | 606 | Fatty acids, total saturated | | 7.77800 | 645 | Fatty acids, total monounsaturated | | 0.80000 | 646 | Fatty acids, total polyunsaturated | +----------+---------+------------------------------------+

You have total fat and then a breakdown of 3 types of fats.

jonathanvx commented 8 years ago

See what happens when you exclude on the word 'Total'

xavierpena commented 8 years ago

Yeah, I think there is some duplication.

OK, that makes sense.

See what happens when you exclude on the word 'Total'

The results are:

ndef.nutrdesc NOT like '%total%'
sum(nutr_val) = 120.52000

and:

ndef.nutrdesc like '%total%'
sum(nutr_val) = 56.48700

anyway, in your example you had:

1 "super-group": Total lipid (fat)

3 "sub-groups":

...all of them containing the word "total", so it seems difficult to make a breakdown of nutrients and their sub-nutrients through naming conventions.

I think the data makes sense regardless, it is normal that you need to be an expert on nutrients to fully grasp the the meaning of the data. It's just good to know how it is organized: the conclusion is that nutrients and "its sub-nutrients" are mixed in the same table.

jonathanvx commented 8 years ago

Would there be any benefit to using: ndef.nutrdesc NOT like 'Total%' or do you get similar issues?

xavierpena commented 8 years ago

With ndef.nutrdesc NOT like 'Total%', the result is sum(nutr_val) = 148.26700