Running the included R scripts to generate figures

mradz19 commented 4 years ago

Sorry I'm still new to this field.

Can anyone give me some advice on how to run the provided r scripts to generate the different figures?l I have completed the full master script so have run the initial run_DESeq_stats.R but I'd like to generate some figures as well. I've been trying to work it out but I'm not sure what to use as input to the script (control_table, experimental_table etc).

Also I would like to look at which bacteria are the most active within the experimental samples relative to the control, as well as what pathways may be up/downregulated within particular genus of interest. Is that possible to do?

For example, based on the RefSeq org results Streptococcus spp seem to be significantly more active in the experimental samples, can I also look at what functions/pathways are active in the Streptococcus spp alone rather than the whole dataset?

transcript commented 4 years ago

Hi Michael,

If you're looking to make combined stacked bar graphs of the results, you can use the make_combined_graphs.R script, which takes a working directory containing the relevant summary files as its input, using the -D option:

Rscript make_combined_graphs.R -D working_directory/ -O save.filename -N number_of_orgs_to_display -G graph_title

The working directory should contain the summary files produced at the end of Stage 5, after the analysis_counter.py script ran on them. These files will look similar to the sample files here: https://github.com/transcript/samsa2/tree/master/sample_files_paired-end/6_RefSeq_org_results/reduced

You can also generate a heatmap or PCA plot, using the make_DESeq_heatmap.R and make_DESeq_PCA.R scripts. These ones also take the working directory with the same files as input, but it's provided using the -I flag (which I probably ought to update in the combined_graphs script):

make_DESeq_heatmap.R -I working_directory/ -N number_of_rows_to_include -O output.name

Finally, you could generate plots of Subsystems results, using the Subsystems_pie_charts.R script. Again, it uses -I to get the input directory, but also uses -L to set the Subsystems level (1, 2, 3, or 4):

Rscript Subsystesm_pie_charts.R -I working_directory/ -L 2 -O output.name

The input directory should contain hierarchy files similar to what's seen here: https://github.com/transcript/samsa2/tree/master/sample_files_paired-end/7_Subsystems_results

I'll close this, but you can reopen this issue if you have any problems with any of these scripts. You can also import all of these tables from Step 5 into R as data.frames, and play around with a lot of other graphing tools in R for visualizing them.

transcript commented 4 years ago

Also adding, as I didn't bother to read your entire question before starting on an answer (whoops, my bad), you can pull out just Streptococcus species' results, if you wish. The method for doing this is to run the DIAMOND_results_filter.py script on the step 4 results and providing the input database, filtering by a chosen keyword, in this case Streptococcus.

python DIAMOND_results_filter.py -I step4_results.m8 -SO Streptococcus -D RefSeq_database.fa -O filtered_results.out

This will produce a step 4 results file that contains ONLY Streptococcus results. You can then put this file into DIAMOND_analysis_counter.py and get all the functions within the file, or get all the different species of Streptococcus.

At the moment, I don't have pathway information being evaluated - although if you have a suggestion for moving from RefSeq results to KEGG (or similar) reference pathways, I'm interested in potentially incorporating it.

Shicheng-Guo commented 4 years ago

Dear Sam,

1) I don't have case and control. I only want to use the stacked bar plot to show the diversity.

2) suppose I have case and control, in which step I can set the file name start as experiment** and control***

3) Can you talk a little more about working_directory? which directory is the best choice? In my analysis, I set the working directory as the OUT_DIR in the master_script.sh, but I like if will be helpful if you can show it clearly in the usage documents.

Thanks.

Shicheng

transcript commented 4 years ago

Hi Shicheng,

I've put up a new R script, "combined_graphs_unified.R", that will created a stacked bar plot for all TSV samples within a folder, without needing control and experimental samples. This should help.
Yes, this is correct.
The working directory for these figure-generating scripts should be the directory where the .tsv output files you wish to visualize are located. Looking at the sample_files_paired_end that download with this GitHub repo, for example, you would want to point at any of the step 6 or 7 folders.

If you wanted to visualize the RefSeq organism results, you'd point at the folder 6_RefSeq_org_results, for example.

The working_directory is just wherever your .tsv files are directed or end up.

Shicheng-Guo commented 4 years ago

Dear Sam,

Thank you so much.

2, I notice there is a blank between the case and control in the stacked bar plot with make_combined_graphs.R, I suggest removing this blank bar.

3, I read the make_combined_graphs.R and found if I change recursive = T, then I can set any parent directory of the result folder:

exp_files <- list.files(pattern = "tsv", full.names = T, recursive = T)

Thanks for the contribution to this field. the SAMSA2 is very useful.

Shicheng

transcript commented 4 years ago

Hi Shicheng,

The space is for aesthetic purposes and better delineate the difference between the two groups, but for anyone looking to remove it, it's easy to remove by taking out "newrow" from the rbind commands on lines 153/154 here: https://github.com/transcript/samsa2/blob/master/R_scripts/combined_graphs_control_vs_exp.R#L153

For your third comment, that's interesting - I've not played with changing the files search parameter for initial selection. I'll try that out!

malyjaguar commented 6 months ago

Hi Sam (and everyone), I have also a question. I am having a set of samples: control and experimental, both taken in different timepoints. So there are two changing variables, actually. Despite renaming them correctly, it somehow seems that the script is "reading" the names as if "day0-1, day0-2, day0-3" etc was more important than the control vs. experimental status. As a result, the two samples are always mixed and only the timescale is kept in the combined graphs. I believe there is an easy fix to this? I am not an R person yet, sorry, so reading through the syntax right now seems a task too gigantic to the little me. Thankies!!

transcript / samsa2

Running the included R scripts to generate figures #39