imfeldn / swiss_pheno

Creative Commons Attribution 4.0 International
1 stars 1 forks source link

Code review - the use of RData #1

Closed khufkens closed 1 year ago

khufkens commented 1 year ago

Hi @imfeldn,

Just some observations while doing code review for reproducibility. During all the procedures you heavily rely on RData files for storing results. I would advice you not to do so. Mainly the variables are declared within the RData itself, and which will invariably lead to bugs which are incredibly hard to trace.

An example:

load(file)

will restore a variable called "meteo_data" in your workspace, but there is no way to figure this out from either the filename or any assignment of the variable.

In short you can't do:

meteo_data <- load(file)

load therefore obscures things and might actually override existing variables unbeknownst to you, or create additional ones if you accidentally save more than what was needed. This lack of explicit declarations for variables often leads to bugs which are 1. not visible 2. consequentially very hard to trace, 3. can be a cause of grave errors because of it (as you might load data which is not what you think it is).

For speed I've split out the generation of the driver data from the model calibration see here: https://github.com/khufkens/swiss_pheno/blob/main/analysis/019_generate_model_drivers.R

This will save the prepared model drivers, from which the various subsets will be generated. This would allow you to only upload these data (what is required) and work from there on - this should speed things up and keep the data preparation separate from the true analysis (model calibration and summary stats). One can argue for putting this in data-raw or data.

khufkens commented 1 year ago

I will do a pull request of the changes if you like. Let me know what your take is on my notes above.

imfeldn commented 1 year ago

Dear Koen,

sorry, I haven't used github for a while. I only saw your reply now, since it went to my private e-mail account. That's changed now.

Yes, I'm aware of the issues with load(). We can work using rds instead. It's also fine for me to flatten the data as you do in 019 and adjust the selection accordingly in 020.

So, you can do a pull request and add your changes. The adjustments didn't fully work for me yet though, but, I guess you're aware of this and it was more exemplary.

I'm not sure about the following lines in the two scripts 020...alternating and 020...late_period though, if I test it on a flattened stn_list object. It's setting all FALSE selection in the training selection to TRUE again? Also, if we only work on the long series, it might not be needed, because in the preselection script (012_stnsel_pheno_longseries.R) it is already tested how long a series is?

select training years

selection <- selection |> dplyr::group_by(site) |> mutate( train = ifelse( length(which(train)) > 1, TRUE, FALSE ) )

select test years

selection <- selection |> dplyr::group_by(site) |> mutate( train = ifelse( length(which(test)) > 1, TRUE, FALSE ) )

Cheers,

Noemi

Am 2023-08-15 10:33, schrieb Koen Hufkens:

Hi @imfeldn [1],

Just some observations while doing code review for reproducibility. During all the procedures you heavily rely on RData files for storing results. I would advice you not to do so. Mainly the variables are declared within the RData itself, and which will invariably lead to bugs which are incredibly hard to trace.

An example:

load(file)

will restore a variable called "meteo_data" in your workspace, but there is no way to figure this out from either the filename or any assignment of the variable.

In short you can't do:

meteo_data <- load(file)

load therefore obscures things and might actually override existing variables unbeknownst to you, or create additional ones if you accidentally save more than what was needed. This lack of explicit declarations for variables often leads to bugs which are 1. not visible 2. consequentially very hard to trace, 3. can be a cause of grave errors because of it (as you might load data which is not what you think it is).

For speed I've split out the generation of the driver data from the model calibration see here: https://github.com/khufkens/swiss_pheno/blob/main/analysis/019_generate_model_drivers.R

This will save the prepared model drivers, from which the various subsets will be generated. This would allow you to only upload these data (what is required) and work from there on - this should speed things up and keep the data preparation separate from the true analysis (model calibration and summary stats). One can argue for putting this in data-raw or data.

-- Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. You are receiving this because you were mentioned.Message ID: @.***>

Links:

[1] https://github.com/imfeldn [2] https://github.com/imfeldn/swiss_pheno/issues/1 [3] https://github.com/notifications/unsubscribe-auth/AH67PVCMLNIIHFEMEO3A72LXVMX55ANCNFSM6AAAAAA3QYV3F4

khufkens commented 1 year ago

You got a bug in my code there! Good eye.

I've refactored some of the workflow with shuffling things to data-raw to generate complete driver files and storing the source files in data-raw.

This would leave relatively small files in the 02_pheno... folder. These are read and subset by the script (in which you found the bug). I've also generated another script for the different training scenarios we talked about (taking either early or later dates for calibration). I didn't want to go to the full extent of creating a full function out of it - but it should be separate rather than relying on commenting out certain parameters - to be fully reproducible (transparent).

Don't worry about not responding to the minute. Point of github in many ways is that you can work asynchronous on things.

khufkens commented 1 year ago

Bug mentioned above fixed by https://github.com/khufkens/swiss_pheno/commit/9cd4e3db8f2b132060510877c423535b1b5fa4c8

@imfeldn I'll keep the routine to check for ultra-short years in there anyway, sticking close to your original code. If you feel confident that this has been checked before in generating the pheno_meta files then we can remove this later on still.

imfeldn commented 1 year ago

Yes, it's okay to keep the routine for very short series.

khufkens commented 1 year ago

I think most of it is addressed now in pull request #2

Do an independent check however (via cloning my project), as I found bugs (late afternoon is not the best time to be fresh enough to code).

imfeldn commented 1 year ago

Yes, I found bugs too. I'll check the second pull request then and get back to you.

khufkens commented 1 year ago

Yes, do clone my project (under my github account) first and spot check if it works, if not either put a comment in this issue or the pull request itself. Then I can fix it and it will be included in the pull request. I just don't have access to all the original files to test it all.

imfeldn commented 1 year ago

I cloned your project, checked it, and made small changes to most of the scripts. There were some details not working, eg. variables missing, the naming wasn't changed everywhere, etc. Since it's already changed in my code, I don't know how to proceed best? I can send it to you, then you can do a new pull request and it remains your refactoring? I can add it to your fork...


Von: Koen Hufkens @.> Gesendet: Dienstag, 15. August 2023 16:16 An: imfeldn/swiss_pheno @.> Cc: Imfeld, Noemi (GIUB) @.>; Mention @.> Betreff: Re: [imfeldn/swiss_pheno] Code review - the use of RData (Issue #1)

Sie erhalten nicht oft eine E-Mail von @.*** Erfahren Sie, warum dies wichtig isthttps://aka.ms/LearnAboutSenderIdentification

Yes, do clone my project (under my github account) first and spot check if it works, if not either put a comment in this issue or the pull request itself. Then I can fix it and it will be included in the pull request. I just don't have access to all the original files to test it all.

— Reply to this email directly, view it on GitHubhttps://github.com/imfeldn/swiss_pheno/issues/1#issuecomment-1679009134, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH67PVBHWOP4MM6HNBJ2MNDXVOAEPANCNFSM6AAAAAA3QYV3F4. You are receiving this because you were mentioned.Message ID: @.***>

khufkens commented 1 year ago

Probably easiest to do the pull request and make the changes again on your side. Just copy over the files you changed on my repo. I can sync my fork with your repository afterwards.

On Tue, Aug 15, 2023, 18:14 nimfeld @.***> wrote:

I cloned your project, checked it, and made small changes to most of the scripts. There were some details not working, eg. variables missing, the naming wasn't changed everywhere, etc. Since it's already changed in my code, I don't know how to proceed best? I can send it to you, then you can do a new pull request and it remains your refactoring? I can add it to your fork...


Von: Koen Hufkens @.> Gesendet: Dienstag, 15. August 2023 16:16 An: imfeldn/swiss_pheno @.> Cc: Imfeld, Noemi (GIUB) @.>; Mention @.> Betreff: Re: [imfeldn/swiss_pheno] Code review - the use of RData (Issue

1)

Sie erhalten nicht oft eine E-Mail von @.*** Erfahren Sie, warum dies wichtig isthttps://aka.ms/LearnAboutSenderIdentification

Yes, do clone my project (under my github account) first and spot check if it works, if not either put a comment in this issue or the pull request itself. Then I can fix it and it will be included in the pull request. I just don't have access to all the original files to test it all.

— Reply to this email directly, view it on GitHub< https://github.com/imfeldn/swiss_pheno/issues/1#issuecomment-1679009134>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AH67PVBHWOP4MM6HNBJ2MNDXVOAEPANCNFSM6AAAAAA3QYV3F4>.

You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/imfeldn/swiss_pheno/issues/1#issuecomment-1679231530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKKUEWXXLFYYDGYECOKV6DXVON6LANCNFSM6AAAAAA3QYV3F4 . You are receiving this because you authored the thread.Message ID: @.***>

khufkens commented 1 year ago

closed as per #2