Makefile - Githubissues

msleckman commented 2 years ago

Added the _targets.R file.

Couple Questions related to the _targets.R file:

What script should I include in source(code.R)? In my ModelPlotting_ms.R i run source('data_extraction_ms.R'), a script that actually has one of the target function below (1rst function). See Comment
I added targets to define global variables. Unsure if this is syntactically correct. See Comment
Assuming, the first arg of tar_targets is the target file (i.e. what is returned by the given function), how should outputs files like figure1.png be written? In the plot function example, the first arg is figure_1_png. My function doesn't not return this variable name , how do I adjust? (same goes for model_diagnostic_text_txt)
Unsure if theformat arg should be set to = 'file' for all the functions I've defined as targets. Can you confirm?

Other:

I've updated the 2 R scripts - data_extraction_ms.R and ModelPlotting_ms.R - by simplifying/homogenizing the global variable names. May be worth double checking these 2 scripts still run properly. this way once this PR is merged, the main will be up to date.

github-learning-lab[bot] commented 2 years ago

So, what does this do for you? Well, if you had this _targets.R file in your current working directory (which you do), and you had defined the download_data(), process_data(), plot_data(), and generate_diagnostics() functions in code.R (which you haven't), it would look like this when you ran tar_make() from the targets package:

targets run image

Now - adding on to this PR by pushing up additional commits - add and modify code, folders, and this makefile so you can build your 3_visualize/out/figure_1.png with targets::tar_make() (BTW, if you haven't run into this syntax, {package_name}::{function_name} allows you to run a function without library({package_name})). Build on your existing folders, functions, and code from the previous sections by continuing with the phases 1_fetch, 2_process, and 3_visualize. Comment in the pull request with a screenshot of the build message, like we have in the message above :point_up:. Assign your course contact to review your PR, and we may ask for a few changes before merging.

There is a file in your repo called .gitignore. Add a line with _targets so that git doesn't track the contents of the targets directory. In addition, make sure that the contents of 1_fetch/out/* will be ignored by git because you don't want to store changes to your downloaded .csv files (you may have already added a more generic */out/* to your .gitignore that handles all out/ folders in this repo).

jesse-ross commented 2 years ago

Hi Margaux,

This is a good start. You have your targets set up in a reasonable-looking way. There are some problems, though, which will keep the pipeline from running to completion.

Here are some pointers to help you along, based on your initial questions.

In your _targets.R, you will need to source any files which have functions you use later in the _targets.R. Otherwise targets won't be able to run them. So you would need to source the file in your 2_process/src/ directory in addition to the one you are sourcing from 1_fetch/src/. **However, note that these files should never execute any code when sourced!** They should just be containers for the functions you're using, and you'll want to remove any code which isn't in the function bodies. In the targets paradigm, all of the code execution needs to happen throughtar_target()` calls.
The way that you are defining your global variables is a little bit off. For instance you have tar_target(file_name, file_name <- 'model_RMSEs.csv'). You should just say tar_target(file_name, "model_RMSes.csv"). Think of tar_target() as a form of assignment. A target has a value, so it's almost like a variable.
The first argument to tar_target() is the name of the target you are defining. Think of it like a variable name. Once a target has been run, the value of the target is accessible to later targets (just like with the targets like file_name which you're defining). This gets a little bit confusing in the case of format = "file". In your example case, your figure_1_png target is calling the plot_model_data() function. It is the job of plot_model_data() to write out the png file, and its return value should be the filename it used. This should always be the case: the return value of a function called in a format = "file" target should always be the filename of the data.
The functions you defined all create files, and so format = "file" is appropriate for them. Just make sure they all return the filename they wrote to. In the case of the four simpler targets you wrote to establish global variables, you correctly used the default format instead of format = "file".
When renaming variables, just make sure you get them changed in every place! When this pipeline runs to completion and creates the necessary files, you'll know you've got it right.

Once you've taken another pass at this, please see if you can get it running with targets::tar_make(). If it runs, please add a comment with a screenshot of the output, like in the bot's comment above. If you run into a dead end and can't get it running to completion, please submit the screenshot anyways so that I can help you debug! Thanks and have fun 😃

msleckman commented 2 years ago

Hi jesse! Super clear and I am making these adjustements now.

I' ve removed the conditional file.exists() check in the ModelPlotting_ms.R script as it is code that is outside the function. Maybe at a later point I can re-include.
What happens in the case where a function may return both a dataframe & a file. That is what is happening with my prep_model_data() function, which returns eval_data and model_summary_results.csv in 2_process/out . and generally, is it possible to set up a target for a function that returns both a file and a dataframe?

jesse-ross commented 2 years ago

Great! Regarding your question in the second bullet point, since the function is writing the eval_data variable to the file.path(output_data_location, 'model_summary_results.csv') location, you can just return the value of file.path(output_data_location, 'model_summary_results.csv') instead of returning eval_data.

You would then need to adjust plot_model_data so that it can take the filename of the CSV and read it back in.

(It is a bit inelegant to be re-reading a CSV instead of just passing the data directly as you were doing before, but for now we're just getting practice with targets of format = "file". Later on we'll work with targets which pass data in a more direct way.)

msleckman commented 2 years ago

That is a bit inelegant, but ok. How about just writing another function that writes model_summary_results.csv and keeping the prep_model_data as is, returning the df eval_data. Is it alright if I add another function? (In the previous issue it's written to split it all into 4 functions).

msleckman commented 2 years ago

Made some changes to the script and to the _targets.R file so that the makefile can runs.

Successful run below: @jesse-ross

msleckman commented 2 years ago

pipeline runs. Merging now.

github-learning-lab[bot] commented 2 years ago

msleckman / ds-pipelines-targets-1

Makefile #7

When you are done poking around, check out the next issue.