General Issue: Recreate WHO growth chart data using code

zdz2101 commented 8 months ago

Background Information

From discussion #13:

For WHO, the reference data are at
https://www.who.int/toolkits/child-growth-standards/standards
You want to click on 'standards' and then on 'weight-for-age', length/height for age, BMI-for-age, or weight-for-length/height. And you want to look at the 'Expanded Tables' - this gives the L M S values by day of age.

One question with the WHO charts is whether people want to use weight-for-length or BMI-for age for infants. The official CDC guidance is to use weight-for-length, but this will result in some missing data for kids who have a length outside the range in the WHO data. There's a least 1 paper I know of that concludes that BMI-for-age is better, but I think the 2 metrics are very similar.

Recreate the data using tribble() for replicability as data source may change overtime and its easier to track than floating csv's Use snapshots for testthat

Definition of Done

See how #15 was done

[x] Render each necessary WHO dataset as specified above into an .rda object using tibble::tribble()
[ ] Use testthat::expect_snapshot() so we can track changes in the feature

ngiangre commented 8 months ago

Just adding a helpful package for completing this issue: datapasta

Pierre-Wallet commented 8 months ago

Hi @rossfarrugia @zdz2101, @CDC-DNPAO,

I have a few questions 😀.

Regarding the metadata: I noticed that the structure of WHO metadata is different from CDC. In WHO, there is one dataset per sex. I would then keep this format and recreate the internal metadata as they are from the WHO website. Is it ok to proceed this way? I will propose a naming convention similar to Zelos.

Regarding the technical aspect: I noticed Zelos created the programs that build the dataset in inst folder. It means that the end user will have access to it when installing the package. Is it something that we want or should we store these programs in a data-raw folder which is not accessible to the end user but only to the developer?

Thanks for your help!

CDC-DNPAO commented 8 months ago

Piere - you're right about the difference between the WHO and CDC reference files. I've always hated the separate WHO files for each sex because it involved more downloading, but whatever you think is the best way to store these files is absolutely fine with me.

It's hard for me to think that most people will actually want to see these reference data files, but there might be some very unusual person who wants to. So if it's easy to make them available, it would be a good idea. Honestly, they are available in my packages, but this is mostly because I didn't know what I was doing when building a package. I feel as if I've learned R by reading StackOverflow answers!

David

On Wed, Mar 6, 2024 at 6:06 AM Pierre-Wallet @.***> wrote:

Hi @rossfarrugia https://github.com/rossfarrugia @zdz2101 https://github.com/zdz2101, @CDC-DNPAO https://github.com/CDC-DNPAO,

I have a few questions 😀.

Regarding the metadata: I noticed that the structure of WHO metadata is different from CDC. In WHO, there is one dataset per sex. I would then keep this format and recreate the internal metadata as they are from the WHO website. Is it ok to proceed this way? I will propose a naming convention similar to Zelos.

Regarding the technical aspect: I noticed Zelos created the programs that build the dataset in inst folder. It means that the end user will have access to it when installing the package. Is it something that we want or should we store these programs in a data-raw folder which is not accessible to the end user but only to the developer?

Thanks for your help!

— Reply to this email directly, view it on GitHub https://github.com/pharmaverse/admiralpeds/issues/17#issuecomment-1980621268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY2XGPVOZJDG3SFYEQLHWETYW32ETAVCNFSM6AAAAABEAE7FWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGYZDCMRWHA . You are receiving this because you were mentioned.Message ID: @.***>

CDC-DNPAO commented 8 months ago

Piere - I've thought about this a little more, and there actually may be some (5%?) users who would like to see the reference data, so it would be a good idea to make the data available (unless it's a lot more work for you)

There's an example of this is the Sitar package: just load it and type the command 'who06' or 'cdc2000'.

I do have a question: were you planning to leave all the files separate, or were you going to combine them in long or wide format? 'who06' shows the reference data in wide format for each week (0.0192 years) of age. LMS are the reference parameters and the values are given for height, weight, BMI, head circumference, etc. 'cdc2000 gives the data at age 24.0, 24.5, 25.5, etc months of age.

Another thing to note is that for CDC reference data, an age value of 24.5 represents all data for kids between 24.0 and 24.999 months of age. cdcanthro linearly interpolates the LMS values for people who actually know the child's age in days. This interpolation doesn't make a big difference in the estimated z-scores and percentiles

david

On Wed, Mar 6, 2024 at 12:26 PM David Freedman @.***> wrote:

Piere - you're right about the difference between the WHO and CDC reference files. I've always hated the separate WHO files for each sex because it involved more downloading, but whatever you think is the best way to store these files is absolutely fine with me.

It's hard for me to think that most people will actually want to see these reference data files, but there might be some very unusual person who wants to. So if it's easy to make them available, it would be a good idea. Honestly, they are available in my packages, but this is mostly because I didn't know what I was doing when building a package. I feel as if I've learned R by reading StackOverflow answers!

David

On Wed, Mar 6, 2024 at 6:06 AM Pierre-Wallet @.***> wrote:

Hi @rossfarrugia https://github.com/rossfarrugia @zdz2101 https://github.com/zdz2101, @CDC-DNPAO https://github.com/CDC-DNPAO,

I have a few questions 😀.

Regarding the metadata: I noticed that the structure of WHO metadata is different from CDC. In WHO, there is one dataset per sex. I would then keep this format and recreate the internal metadata as they are from the WHO website. Is it ok to proceed this way? I will propose a naming convention similar to Zelos.

Regarding the technical aspect: I noticed Zelos created the programs that build the dataset in inst folder. It means that the end user will have access to it when installing the package. Is it something that we want or should we store these programs in a data-raw folder which is not accessible to the end user but only to the developer?

Thanks for your help!

— Reply to this email directly, view it on GitHub https://github.com/pharmaverse/admiralpeds/issues/17#issuecomment-1980621268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY2XGPVOZJDG3SFYEQLHWETYW32ETAVCNFSM6AAAAABEAE7FWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQGYZDCMRWHA . You are receiving this because you were mentioned.Message ID: @.***>

Pierre-Wallet commented 8 months ago

Hi David,

my point was to discuss the access to the program that builds the reference data, not the data itself. Of course the reference data will be available.

I am in favor of keeping all files separated to reflect exactly what is contained in the .xls files on the WHO site. It is easier for any end-user to compare the initial source to what we store in the package. Plus these reference dataset are the unique source of truth to start all the upcoming derivations. Combining them increases the risk of making a handling mistake and the overall maintenance. This is only my opinion and based on my current knowledge of this project.

Pierre

CDC-DNPAO commented 8 months ago

OK - whatever you think is best is fine with me

david

On Wed, Mar 6, 2024 at 5:36 PM Pierre-Wallet @.***> wrote:

Hi David,

my point was to discuss the access to the program that builds the reference data, not the data itself. Of course the reference data will be available.

I am in favor of keeping all files separated to reflect exactly what is contained in the .xls files on the WHO site. It is easier for any end-user to compare the initial source to what we store in the package. Plus these reference dataset are the unique source of truth to start all the upcoming derivations. Combining them increases the risk of making a handling mistake and the overall maintenance. This is only my opinion and based on my current knowledge of this project.

— Reply to this email directly, view it on GitHub https://github.com/pharmaverse/admiralpeds/issues/17#issuecomment-1981974163, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY2XGPUAQ5F5GV3KZWBME7LYW6K7ZAVCNFSM6AAAAABEAE7FWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBRHE3TIMJWGM . You are receiving this because you were mentioned.Message ID: @.***>

rossfarrugia commented 8 months ago

I am in favor of keeping all files separated to reflect exactly what is contained in the .xls files on the WHO site.

Makes sense to me Pierre! In the rare chance they ever change then it would be easier to maintain as you say.

Pierre-Wallet commented 8 months ago

Just adding a helpful package for completing this issue: datapasta

Thanks @ngiangre for this useful package! I knew it but it was the 1st time I used it, so thanks for this opportunity.

One thing that did not work well is the comma handling. Indeed the decimal separator is the comma in the source files I read. And {datapasta} did not handle it well. I had to do a "find and replace" to replace the commas with points within the whole file and then it worked well. Do you know by any chance if there is a more direct operation? Thanks, Pierre.

ngiangre commented 8 months ago

Glad that pkg was somewhat helpful! I have not used it myself - I just saw it circulating on fosstodon :)

If you find a better approach let us know! What might be hard is knowing when those files update. Maybe use the {fs} or {httr2} pkgs to see if the files updated (e.g. time stamp)? Might be helpful so that we can subsequently update the data in our package. If anything, maybe include a test for this - we would then have to update our pkg accordingly if those files were ever updated. Does this make sense?

Cheers, Nick Giangreco, PhD nickg.bio

On Thu, Mar 7, 2024 at 7:25 AM Pierre-Wallet @.***> wrote:

Just adding a helpful package for completing this issue: datapasta https://milesmcbain.github.io/datapasta/

Thanks @ngiangre https://github.com/ngiangre for this useful package! I knew it but it was the 1st time I used it, so thanks for this opportunity.

One thing that did not work well is the comma handling. Indeed the decimal separator is the comma in the source files I read. And {datapasta} did not handle it well. I had to do a "find and replace" to replace the commas with points in all the file and then it worked well. Do you know by any chance if there is a more direct operation? Thanks, Pierre.

— Reply to this email directly, view it on GitHub https://github.com/pharmaverse/admiralpeds/issues/17#issuecomment-1983402400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3TDSY23CQ5BOHMBBFBK5LYXBMDFAVCNFSM6AAAAABEAE7FWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBTGQYDENBQGA . You are receiving this because you were mentioned.Message ID: @.***>

Pierre-Wallet commented 8 months ago

@zdz2101 , @rossfarrugia

having an issue with my PR that I do not really understand. The roxygens comments seem to be aligned with .Rd files. Any help is parreciated. Thanks in advance.

Below is the issue from the pipeline:

D man/WHO_bmi_for_age_boys.Rd D man/WHO_bmi_for_age_girls.Rd ?? man/who_bmi_for_age_boys.Rd ?? man/who_bmi_for_age_girls.Rd 🙈 Manuals are not up-to-date with roxygen comments! 🔀 The following differences were noted:

💻 Please rerun the following command on your workstation and push your changes

roxygen2::roxygenize('.', roclets = c('rd', 'collate', 'namespace'))

ℹ roxygen2 version that was used in this workflow: ‘7.3.1’ 🙏 Please ensure that the 'RoxygenNote' field in the DESCRIPTION file matches this version Error: Process completed with exit code 1.

zdz2101 commented 8 months ago

@Pierre-Wallet just run devtools::document() , commit and push the renaming of the .Rd files, the two you have pushed rn is capital WHO and the roxygenation turns it into lowercase

Pierre-Wallet commented 8 months ago

@zdz2101 , already done before , it does not change anything.

I updated the name of all the metadatasets and I do not understand why there is a specific issue with these 2 only.

zdz2101 commented 8 months ago

weird, I pushed a commit for you, looks like its going through now, looks ready to review soon, just tag me once you think you're done

Pierre-Wallet commented 8 months ago

I do not really get what is happening because I did not do anything. And now everything is in green! THanks for your help.

possible that there is a latency between the push and the pipeline run?

zdz2101 commented 8 months ago

I do not really get what is happening because I did not do anything. And now everything is in green! THanks for your help.

possible that there is a latency between the push and the pipeline run?

Pipelines generally will cancel out (if they were running) and immediately get reran after every push. The two rd files on github were still capitalized, so you might've had two uncommitted changes, I just pushed the renaming for you and we're all good now

rossfarrugia commented 8 months ago

As discussed today @Pierre-Wallet will also add the head circumference metadata as part of this PR.

pharmaverse / admiralpeds