matthiaskoenig / pkdb

Pharmacokinetics database
https://alpha.pk-db.com
29 stars 6 forks source link

Fix issues with group counts in frontend (better communication of group and parent group traits) #736

Open ReneGeci opened 1 year ago

ReneGeci commented 1 year ago

Hello,

I am currently exploring your data base and I am confused by how to read it and unfortunately, I am not able to find a documented explanation of this.

Let's say I am interested in the timecourses from Ho2007 (PKDB00544). Especially, timecourse row 4069 which is the group "EuropeanAmericans". The "group_pk" of this group is 2099. When I now look up information about sex in "groups" for this group_pk, then I see 2 rows:

16553 Ho2007 PKDB00544 sex 69 European-Americans 37 sample mean 20980 M 16554 Ho2007 PKDB00544 sex 69 European-Americans 70 sample mean 20980 F

Now I do not understand how to interpret the "count" or "group_count" entries here. How many males and females are represented in the corresponding timecourse?

The "count" entries are 37 for females and 70 for males. But overall, in the entire group, there are only supposed to be 69 individuals...

This is confusing to me.

matthiaskoenig commented 1 year ago

Hi @ReneGeci,

thanks for interest in the database. The problem is that certain group traits are not reported for all subgroups. I.e. there is information available in the paper that of the total 107 subjects 37 are male and 70 are female, but the sex was not reported for the subgroup of EuropeanAmericans. I.e. we know that the 69 EuropeanAmericans were sampled from 107 subjects with a male/female ratio of 37/70. This does not say exactly how many males and females were in the subgroup, but provides some indication of the composition of the subgroup.

Unfortunately this information is currently not rendered correctly in the frontend due to a bug (see screenshot)

image

This should state 37/107 Male and 70/107 Female instead of 37/69 Male and >69 Female. I.e. the denominator is using the incorrect group count. The correct data is in the JSON https://alpha.pk-db.com/api/v1/groups/2736/?format=json

We will fix the frontend and indicate more clearly which characteristics were reported for a group and which have been inferred from a parent group.

I hope this helps. Changing the tile of the issue.

Best Matthias

ReneGeci commented 1 year ago

Hey @matthiaskoenig, thanks a lot for the quick clarification! That makes a lot of sense.

Now, I am wondering of course how to best handle this when importing data.

In other words, how do I identify now if given trait information is referring to the parent group or to the subgroup for which the timecourse is provided?

In cases where the count is bigger than the subgroup size its obvious that the given trait information is not subgroup specific. But when it is smaller (say 3 smokers), then that could still be refererring to both the subgroup (3 smokers out of 69 subgroup subjects) or the parent group (3 smokers out of say 107 subjects), right?

matthiaskoenig commented 1 year ago

You need the combination of count and group count. In addition you should have the group ids for all the data and the group hierarchy. I.e. in the group table you have a parent field which defines the parent groups. By using this information you can figure things out.

The problem is mostly the reporting in studies. We try to capture all the information on the groups, but often things are only provided for the complete study subjects, not subgroups.

ReneGeci commented 1 year ago

I fully understand that the literature is very difficult to integrate and messy. Thats exactly why I am so happy about a database like PK-DB but I am really unsure how to do this.

Now, if I am for example interested in how many females were involved in a timecourse, then I look up the count for "sex" where "choice" is "F". If say the count is 9 but the group_count is only 6, then I know that this information cannot be specifically about the subgroup in the timecourse. But if it is equal or smaller, then how can I be sure that it is? If I go up to the parent group and for this one it also reports 9 females, then it can either be that the subgroup is made up entirely of females and those are also all the females in the parent group. Or it could still be that the information provided for the subgroup was just the number of females in the entire group, correct? So, how would I know the difference?