hw07 ready for grading - Githubissues

Dear colleagues, dear STAT545/547 team,

I tried something new, and it didn't go down exactly as planned, but I still hope you will enjoy reading my homework :)

Homework 7 repository HERE

Homework 7 summary RMD file here
Homework 7 produced tables files here
Homework 7 plots files here
Homework 7 stats files here

Don't hesitate to let me know if you have any questions !

Thank you for your time! Regards, My Linh Thibodeau

Dear @CassKon, dear @mlawre01,

Welcome to my repository ! Here are a few helpful points for your to know about my homework 7:

The genomic data used is a set of publicly available reference mutational signatures and cancer genomic mutation samples (ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl/). Please note that in order to run my code, you would need to first download the original dataset to your local computer. If you want to skip that step, it's totally fine, but you will have to rely on my output files to assess my homework.
Given the size of genomic data files in general, pushing some raw data files and formatted data files to Github was not feasible and some files are locally stored (again, if you want to test my code, you need to download the raw data first).
I have not included the plots directly in the Rmd files because given the number of plots, they were quite distracting in the reading of the code itself and also because I have pdf format images (and not png) and it seemed to automatically display a link instead of the image itself in the Rmd.

There is a very ugly "stitched pdf file" of the "summary Rmd file" (see README for explanation), but although it might be a bit challenging to follow along my homework but ...

I highly recommend you stick with the original and cleaner files (hyperlinks in message above) or that you download the original data to run my code, because the pdf does not make justice to the code at all !!

If you have any questions, please don't hesitate to let me know. Thank you so much for your time and I assure you that your feedback is greatly appreciated and I welcome all suggestions.

Warm regards,

My Linh

Hi My Linh,

Great job on the assignment - wow! I think you could have used a few more figures though … haha Just kidding you definitely met that quota!
For the summary stats where you talk about using a loop, I ran into the same problem. My solution was to write a function and then use ddply to execute the function on parts of the data frame. I’m sure that there are other options out there though!
I found the rmd hard to follow as it didn’t have a lot of the actual coding in it and it was a lot of back and forth between links (downloading the code did help here!) . Really great comments throughout the file though!
Really great extension of the homework exercise to another dataset and sounds like you will be able to use some of this code again which is awesome!
I am not sure that I have any idea to contribute towards your data cleanup issue :(
I enjoyed how detailed the introduction and reflection were for your assignment. It helped give context to your assignment

Cheers,

Cassandra

Hey,

Overall, very impressive assignment; great job tackling and applying what you’ve learned to a new dataset! In general, it seems like you put a ton of work into your assignment and learned a lot which is great.

Some notes:

You made really great figures and it would have been nice if your summary contained the figures
Because of this, I found myself jumping around a lot, from plots, back to the summary and to the R scripts to put it all together
You included a lot of helpful commentary in your introduction, summary file and comment and I think breaking your assignment down into smaller chunks is a great idea, especially for your future self
not to mention, the commentary made it easier to follow along
I liked that you included links to references and resources throughout your summary file, this will definitely help jog your memory in the future if you do re-use any of your code!
Very comprehensive progress report/reflections; I wish I had more to add about your code, but I think you’ve delved into it deeper than I have
It would have been nice for ease of reviewing if your R scripts were numbered or something just so it would be more obvious without flipping back and forth from the summary file
I guess my biggest comment would be the organization/flow could be better; I think because your pipeline was more complicated and you have more inputs/output it is more difficult but even just having a summary file with all the plots would do the trick
I am in a lab that generates and analyzes sequencing data (although I haven’t had chance to do this yet!) and I can definitely see how automating a pipeline would be really helpful for converting FASTQ files to other useful data types!

Updated comment (I was really tired when I wrote this):

I had a similar problem when I was making my final report file with .pdf images - it didn't seem to matter if I tried to add them locally or if I pushed them up into GitHub and added the link - they wouldn't render
I ended up saving my plots as .png and this resolved the issue...it's such a simple fix but it took me a while to figure it out

Again, great job on homework assignment 7 & good luck with the rest of the course!

Three or more scripts, an Rmd, and a Makefile: Yes Starts by downloading data, ends with Rmd: Yes The output of each is the modified input of the previous step: No (see comments) Includes some analysis and at least one figure: Yes Makefile includes all scripts and Rmd with correct dependencies: No (see comments) Makefile runs: No (see comments) Bonus: Non-gapminder dataset. Reflection: Yes

Comments:

Step 4 (mut_sig_tables.R) uses output of step 2 (read_clean_genome_text_files.R ), while Step 6 (mut_sig_plot.R) uses output of Steps 2-4.
mut_sig_tables.R does not use ALL_sig.tsv or aml_sig.tsv (instead uses ALL_mut and aml_mut) - looks like a typo.
mut_sig_plot.R lists 4 dependencies, but reads 10 files. Even if it runs fine from a clean slate, if one of the 6 unlisted files is somehow deleted, the pipeline will then crash when you try to run mut_sig_plot.R .
You write 16 separate tsv files in mut_sig_tables.R. Do you really need to access all of those? You don't have to save every intermediate step, but the significant ones.
Given that pdfs don't embed in the report well, why not save as png files?
Your clean statement works fine on my machine. Might be a directory issue.
Speaking of directory issues, your directories in read_clean_genome_text_files.R are much too specific – it only functions if STAT545-HW-thibodeau-mylinh is directly on the user's Desktop, not likely to be the case for people who clone your repo and possibly not even the case on your own machine in a year or two. Likewise with the cancer specific mutational signature files – which your Makefile should include rules to download, rather than assuming that the user already has them.
I'm confused by what read_clean_genome_text_files.R actually does in terms of cleaning – it looks like you just read the files in and re-save them. Are you basically just moving them to the STAT545 directory?

Your mark will be distributed later. If you would like more feedback, please feel free to message me on slack.

New info: apparently windows machines use del instead of rm, so that might be your problem with clean.

Dear @ksedivyhaley,

Thank you so much for your thorough review, I greatly appreciate it and I am grateful for the specific feedback you are providing because I really want to improve my ability to manage my research data ! I am trying to optimize my course learning in order to directly translate knowledge to immediate applications in my field. I am sorry about the lack of clarity in my homework, the Canadian Cancer Research Conference kept me quite busy until November 7/2017 and I understand it must have been a bit dry for you to go through my homework, sorry again for that.

If you wouldn't mind, I would love to have a few additional details on the points you raised, because those are actaully problems I struggle with and I don't know how to tackle them.

Comment: Step 4 (mut_sig_tables.R) uses output of step 2 (read_clean_genome_text_files.R ), while Step 6 (mut_sig_plot.R) uses output of Steps 2-4.

Perhaps I didn't understand the assignment instruction correctly, so better late than never: Was I suppose to make each step completely and solely dependent on the previous step?
My thought process was basically like doing meal prep for kitchen: I cut all the veggies separately on one side (step 2 + 3) and the sauce (step 4) and spices (step 5) and then I can make the pasta dish come together (step 6) ? But again, I think I might not have understood correctly (also, I am not very good in the kitchen).

The breakdown of my plan was the following, and if you could point me the specific problematic piece, I would really appreciate it :)

Step 3 (mut_sig_reorder.R) uses output from step 1 (curl to get mut_sig_raw.txt) and step 2 (read_clean_genome_text_files.R)
Step 4 (mut_sig_tables.R) uses output from step 2 (read_clean_genome_text_files.R) to transform the format of tables and "stack" the data
Step 5 (mut_sig_stat.R) uses output from step 4 (Rscript mut_sig_tables.R) to get some weighted values. Here, I was trying to get a representative values of how much a signature contribute as a proportion of mutation rather than an absolute number, because I wanted to decipher if a mutation was high because the overall mutation burden (mutation count) was high, or because it was the main mutational process driving mutation count.
Step 6 (mut_sig_plot.R) uses output from step 2 - 3 - 4.
Step 7 (rmarkdown::render("summary_file.Rmd")) requires all of the above.

Comment: mut_sig_tables.R does not use ALL_sig.tsv or aml_sig.tsv (instead uses ALL_mut and aml_mut) - looks like a typo.

Again, so sorry for that cryptic one, but ALL refers to Acute Lymphoid Leukemia (ALL) while all refers to "all of the elements" and aml refers to acute myeloid leukemia. Therefore, all (in the sense of the totality) of the files are used by mut_sig_tables.R
- ALL_mut.tsv -> read to variable ALL_mut -> gather step to stack -> write to ALL_mut_gather.tsv
- aml_mut.tsv -> read to variable all_mut -> gather step to stack -> write to aml_mut_gather.tsv
- and so on with each cancer type

I remember one of the step had a problem with ALL_mut.tsv because unlike the others, it only had one dataset, but I thought it ended up still working in the end.

Comment: mut_sig_plot.R lists 4 dependencies, but reads 10 files. Even if it runs fine from a clean slate, if one of the 6 unlisted files is somehow deleted, the pipeline will then crash when you try to run mut_sig_plot.R.

Thank you, I will ensure to list all the dependencies, seems like good practice. Quick question: so if for example the previous steps did not produce the files, would the fact that I list the dependencies provide me more details about where the problem is in the pipeline ? (as opposed to just crashing without clear answer)

Comment: You write 16 separate tsv files in mut_sig_tables.R. Do you really need to access all of those? You don't have to save every intermediate step, but the significant ones.

I had to save the files because I tried to do something too ambitious and there was a ton of problems and I couldn't figure out where it was coming from. The intermediate files always help me to to so.

Comment: Given that pdfs don't embed in the report well, why not save as png files?

That one is actually funny because the reason is quite simple: I hadn't thought about it ;) Very good point, thank you !

Comment: Your clean statement works fine on my machine. Might be a directory issue.

I am using a mac, but apparently, it has some issues that requires semi-urgent attention apparently. However, given my limited coding skills, it is hard for me to know if the problem comes from the code (the most frequent reason) or from my computer. Surprisingly, this time might be the computer's fault and not my code.

Comment: Speaking of directory issues, your directories in read_clean_genome_text_files.R are much too specific – it only functions if STAT545-HW-thibodeau-mylinh is directly on the user's Desktop, not likely to be the case for people who clone your repo and possibly not even the case on your own machine in a year or two. Likewise with the cancer specific mutational signature files – which your Makefile should include rules to download, rather than assuming that the user already has them.

One thing of note is that the files are too large to be kept on github. I only learned about gitignore after the homework, so I didn't know that you could put files in a git repository and tell git to "ignore" large files at the time of submitting my homework.
But even if I put them inside the repository, if I tell gitignore to ignore the files, should I tell the User who might clone my repository that they need to download the data themselves and place it in the same repository I indicated? What is the best approach in such cases?

Comment: I'm confused by what read_clean_genome_text_files.R actually does in terms of cleaning – it looks like you just read the files in and re-save them. Are you basically just moving them to the STAT545 directory?

That is a very good question, and if I had taken time to annotate my code better, I would have a clearer explanation but I only remember something vague about the fact that R didn't like manipulating the text file and caused me problems with the fact that the header contained spaces, so I used the read_clean_genome_text_files.R file to transform the txt file into tsv files so that it would substitute all the white spaces that caused me problems with dots ("Mutation Type" becomes "Mutation.Type"). If you experienced or know of the same issue, please do tell me the correct way to do this, I beg you ;)

Well something I can promise you is that I will annotate my code better, which I think I succeed to do in homework 8. Looking back at how cryptic homework 7 is to read encouraged me to make a tutorial for homework 8.

Thank you again so much for your help,

Warm regards, My Linh Thibodeau

mylinhthibodeau / STAT545-HW-thibodeau-mylinh

hw07 ready for grading #7

I highly recommend you stick with the original and cleaner files (hyperlinks in message above) or that you download the original data to run my code, because the pdf does not make justice to the code at all !!