steno-aarhus / kimo

(ABANDONED) UK Biobank Project
1 stars 0 forks source link

create data in UKB RAP #1

Closed arborzhang closed 11 months ago

arborzhang commented 1 year ago

I opened my project (kimo) at UKB RAP, and followed the instructions, but I got below errors: https://steno-aarhus.github.io/ukbAid/using-rap.html and https://github.com/steno-aarhus/kimo/blob/main/data-raw/create-data.R

readr::read_csv(here::here("data-raw/rap-variables.csv")) %>% dplyr::pull(rap_variable_name) %>% ukbAid::create_csv_from_database()

Error in system(table_exporter_command, intern = TRUE) : cannot popen 'dx run app-table-exporter --brief --wait -y -idataset_or_cohort_or_dashboard=record-GXZ2k40JbxZx7xYGF66y45Yq -ifield_names='Verbal interview duration | Instance 0' -ifield_names='Verbal interview duration | Instance 1' -ifield_names='Verbal interview duration | Instance 2' -ifield_names='Verbal interview duration | Instance 3' -ifield_names='Biometrics duration | Instance 0' -ifield_names='Biometrics duration | Instance 1' -ifield_names='Biometrics duration | Instance 2' -ifield_names='Biometrics duration | Instance 3' -ifield_names='Sample collection duration | Instance 0' -ifield_names='Sample collection duration | Instance 1' -ifield_names='Sample collection duration | Instance 2' -ifield_names='Sample collection duration | Instance 3' -ifield_names='Conclusion duration | Instance 0' -ifield_names='Conclusion duration | Instance 1' -ifield_names='Conclusion duration | Instance 2' -ifield_names='Conclusion duration | Instance 3' -ifield_names='Heel ultrasound m

lwjohnst86 commented 1 year ago

Yea, I was debugging the issue last night, and I think its an issue me and Daniel encountered earlier where if the variable has like special characters (comma, slash, etc), it affects how it works on the command line (not the R Console). I will have to find some ways to escape/make it work, but might be some time. In the meantime, you can work on the protocol and clarifying the variables you are interested in. I still think you have waaayyy too many variables that I guarantee you won't end up using, but we can see how it goes :stuck_out_tongue:

arborzhang commented 1 year ago

I deleted some variables and tried the process at UKB RAP again, but still got errors.

I am afraid there are several gaps I might misunderstand and it will be appreciated to be clarified. Please see my questions below.

# Keep only the necessary variables for RAP -------------------------------
# the necessary variables are kept in the `data-raw/project-variables.csv`
  1. What do you mean after running this function, and what changes are supposed to be observed after running this function? In another word, what is the difference between project-variable and rap-variable and what is the purpose of magrittr?
library(magrittr)
# After the variables have been properly selected in the `data-raw/project-variables.csv`
# file, run this function so that only the selected variables are kept in the
# `data-raw/rap-variables.csv` file. This file has the exact variable names used
# by RAP that we need in order to create the project-specific dataset. After
# running this function, review the changes in Git and add and commit the changed
# files into the history. 

# Uncomment if you messed up and need to start over.
#ukbAid::project_variables %>%
#     readr::write_csv(here::here("data-raw/project-variables.csv"))

# Update if necessary.
ukbAid::rap_variables %>%
     readr::write_csv(here::here("data-raw/rap-variables.csv"))

ukbAid::subset_rap_variables(instances = 0:9)
  1. One thing is really strange is that after i ran above function, it seems all variables in the original variables came back even I deleted them.
  2. In my project variable list (data-raw/project-variables.csv), there are around 400 variables and I am pretty sure variables showing in the error message are not kept. But after running this function, the rap-variables became 1.1 MB and showing a lot of variables were not selected. That might be why there are error message like 'Verbal interview duration'.
lwjohnst86 commented 1 year ago

I updated your comments.

  1. I'm not sure how to answer the question about why we are using magrittr. It is a package to make use of the pipe %>%. The rap-variables.csv file gives the names of the variables needed for extracting from the RAP. They are slightly different from the ones in the project-variables.csv, for instance, they have _i1 or _i2 at the end, which indicates the collection visit. Basically, you edit project-variables.csv, and use the ukbAid::subset_rap_variables(instances = 0:9) to update the RAP variables from the project variables list.
  2. The important function to run is ukbAid::subset_rap_variables(instances = 0:9), this is what takes the variables you select and delete in project-variables.csv and update them so that RAP knows which variables to select from their own database.
  3. Maybe this question will be answered from the above? In project-variables.csv, if you select one variable (one row), then in the rap-variables.csv, there will be more likely multiple rows for that one variable for each instance, since that variable actually has up to 9 other variables for each timepoint (e.g. p21353_i0, p21353_i1, p21353_i2, etc).
arborzhang commented 1 year ago

It totally makes sense. Thank you so much, this helps me a better understanding of different steps 👍

**But still got below error, I also tried taking out age at death, then the error became unrecognized arguments: age

dx: error: unrecognized arguments: age at death Error in system(table_exporter_command, intern = TRUE) : error in running command

arborzhang commented 1 year ago

I took out more variables, but still got below error. It seems related to "(" in variables, but also folder destination setup?

readr::read_csv(here::here("data-raw/rap-variables.csv")) %>%

  • dplyr::pull(rap_variable_name) %>%
  • ukbAid::create_csv_from_database() Rows: 1028 Columns: 3
    ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: ","

chr (3): field_id, rap_variable_name, id

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. ℹ Started extracting the variables and converting to CSV. ! This function runs for quite a while, at least 5 minutes or more. Please be patient to let it finish. sh: 1: Syntax error: "(" unexpected dxpy.exceptions.ResourceNotFound: The specified folder could not be found in project-G9zB9B8JqFgQ6Pjx5kPGzzvQ, code 404. Request Time=1697813300.5873065, Request ID=1697813300794-198333 The destination folder does not exist ✔ Finished saving to CSV. Check "/mnt/project/users/jiezhang" or the project folder on the RAP to see that it was created. [1] NA NA

Warning message: In system(table_exporter_command, intern = TRUE) : running command 'dx run app-table-exporter --brief --wait -y -idataset_or_cohort_or_dashboard=record-GXZ2k40JbxZx7xYGF66y45Yq -ifield_names='Sex' -ifield_names='Date of birth' -ifield_names='Year of birth' -ifield_names='Waist circumference | Instance 0' -ifield_names='Hip circumference | Instance 0' -ifield_names='Standing height | Instance 0' -ifield_names='Month of birth' -ifield_names='UK Biobank assessment centre | Instance 0' -ifield_names='Non-cancer illness year/age first occurred | Instance 0' -ifield_names='Pulse rate (during blood-pressure measurement) | Instance 0' -ifield_names='Birth weight known | Instance 0' -ifield_names='Job code at visit - entered | Instance 0' -ifield_names='Number of self-reported non-cancer illnesses | Instance 0' -ifield_names='Number of treatments/medications taken | Instance 0' -ifield_names='Townsend deprivation index at recruitment' -ifield_names='Reason lost to follow-up' -ifield_names='Date lost to follow-up' -ifield_names='Date of consenti [... truncated]

lwjohnst86 commented 1 year ago

Yea, i also know that ' or " can also cause some problems.... I think it will require some coding on my end to fix the problem they have on their end :angry:

lwjohnst86 commented 11 months ago

Damn, I have no idea what is going on here... I'll keep digging, hopefully I'll have a solution by Monday :grimacing:

arborzhang commented 11 months ago

Thank you so much, I really appreciate your kind help. I also tried only two variables (age and weight), but still got an error. Best, Jie

On Sat, Nov 4, 2023 at 10:14 PM Luke W Johnston @.***> wrote:

Damn, I have no idea what is going on here... I'll keep digging, hopefully I'll have a solution by Monday 😬

— Reply to this email directly, view it on GitHub https://github.com/steno-aarhus/kimo/issues/1#issuecomment-1793556405, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHYM22YP4BOCPSE62MXGXBLYC2V2VAVCNFSM6AAAAAA6FMGVOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGU2TMNBQGU . You are receiving this because you authored the thread.Message ID: @.***>

lwjohnst86 commented 11 months ago

I think I fixed it, but not sure. Could you test it out with your fuller list of variables?

arborzhang commented 11 months ago

I only tried with two variables: age and weight. This is the error I got:

ukbAid::subset_rap_variables(instances = 0:9) ℹ Updated the "data-raw/rap-variables.csv" based on the selected project variables.

A tibble: 5 × 3

field_id rap_variable_name id

1 p31 Sex p31 2 p21002_i0 Weight | Instance 0 p21002 3 p21002_i1 Weight | Instance 1 p21002 4 p21002_i2 Weight | Instance 2 p21002 5 p21002_i3 Weight | Instance 3 p21002 readr::read_csv(here::here("data-raw/rap-variables.csv")) %>% + dplyr::pull(rap_variable_name) %>% + ukbAid::create_csv_from_database() Rows: 5 Columns: 3 ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "," chr (3): field_id, rap_variable_name, id

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. ℹ Started extracting the variables and converting to CSV. ! This function runs for quite a while, at least 5 minutes or more. Please be patient to let it finish. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/dxpy/scripts/dx.py", line 2858, in run_one dxexecution.wait_on_done() File "/usr/local/lib/python3.8/dist-packages/dxpy/bindings/dxjob.py", line 283, in wait_on_done raise DXJobFailureError(err_msg) dxpy.exceptions.DXJobFailureError: Job has failed because of AppError: Invalid characters found in field names at position(s) 2, 3, 4, 5 of the input. dxpy.utils.resolver.ResolutionError: Unable to resolve "data-jiezhang-kimo.csv" to a data object or folder name in '/' ✔ Finished saving to CSV. Check "/mnt/project/users/jiezhang" or the project folder on the RAP to see that it was created. [1] "job-Gb3yFV0JqFgbzfbkx3PgZVKB" NA
Warning message: In system(table_exporter_command, intern = TRUE) : running command 'dx run app-table-exporter --brief --wait -y -idataset_or_cohort_or_dashboard=record-GXZ2k40JbxZx7xYGF66y45Yq -ifield_names="Sex" -ifield_names="Weight | Instance 0" -ifield_names="Weight | Instance 1" -ifield_names="Weight | Instance 2" -ifield_names="Weight | Instance 3" -ioutput=data-jiezhang-kimo' had status 1

lwjohnst86 commented 11 months ago

We did some updates today and tried to fix some issues. We got it to run properly but now dealing with an issue that the UKBiobank variables are different from the ones in RAP (e.g. the Townsend index, which has two variables, one of which p189 is restricted and we can't access it, so it gives an error). We'll try to find a programmatic way to deal with this, but in the mean time, you have have to manually look through the UKBiobank documentation and find if the variable is restricted or not.

arborzhang commented 11 months ago

Thanks for the update. I tried again and could not even download the kimo project at the first step (AFTER open UKB RAP and install the ukbaid package). Are there any changes of the process I should be aware?

── Downloading your GitHub project ─────────────────────────────────────────────── ℹ Lastly, we need to download your project. Please answer this question. ℹ Defaulting to 'https' Git protocol Error in `gh::gh()`: ! GitHub API error (404): Not Found ✖ URL not found: ℹ Read more at Run `rlang::last_trace()` to see where the error occurred. --   > | > >
lwjohnst86 commented 11 months ago

I think this has been fixed :star_struck: