thbuerg / NeuralCVD

🫀 Code for "Neural network-based integration of polygenic and clinical information: Development and validation of a prediction model for 10 year risk of major adverse cardiac events in the UK Biobank cohort" 🫀
https://www.thelancet.com/journals/landig/article/PIIS2589-7500(21)00249-1/fulltext
GNU General Public License v3.0
12 stars 4 forks source link

Request for More Information on Hardcoded References in Preprocessing #1

Open ShaunFChen opened 1 year ago

ShaunFChen commented 1 year ago

Hello,

I have been exploring your NeuralCVD repository for our study and appreciate the considerable effort put into this tool. We believe it has a potential to make a significant contribution to our research. However, I have been encountering some difficulties during the preprocessing step.

The tool appears to have hardcoded references to files under:

path = "/data/analysis/ag-reils/steinfej/code/umbrella/pre/ukbb"
data_path = "/data/analysis/ag-reils/ag-reils-shared/cardioRS/data"

in the subfolder named mapping, also:

codes_gp_records = pd.read_feather(f"{data_path}/1_decoded/codes_gp_diagnoses_210119.feather").drop("level", axis=1)
codes_hospital_records = pd.read_feather(f"{data_path}/1_decoded/codes_hes_diagnoses_210120.feather")

which didn't include in the output of "0_decode_ukbb.ipynb".

While I understand that the UK Biobank codings are used in your tool, and I'm able to obtain those, there are other datasets which are not clear to me: atc, phecodes, snomed_cor_list, and athena_vocabulary_covid. I am having difficulty confirming the consistency of these data and their format with what the tool requires. In order to correctly run the tool and ensure the validity of our results, it's crucial that we have the same version and format of these specific datasets. Unfortunately, the current resources do not provide sufficient details to accurately reproduce this setup.

As a result, I kindly request you to share these referenced data directly, if it's possible and within compliance.

However, if direct access is not feasible due to any constraints, could you please provide further information on how to obtain or generate these datasets? This ideally includes the specific versions of these datasets, the expected formats, and any preprocessing steps required for compatibility with NeuralCVD.

Your assistance will greatly aid us in overcoming this roadblock, and will facilitate the effective use of this tool in our research.

Thank you for your time and for your invaluable contributions to the field.

Best regards, Shaun

DhanushB2000 commented 2 months ago

Thank you for developing such a good code snippet of exploring the UKBiobank data. I was also looking into this comprehensive code for getting familiarised to work with UKBB data. It would be great if you could share the files that were used for the code (like as mentioned in the previous comment as well as). Your help is much appreciated, requesting you to share those files.

Regards, Dhanush