loosolab / TOBIAS_snakemake

Snakemake pipeline for running TOBIAS analysis
MIT License
3 stars 2 forks source link

CreateNetwork clarifications --- toatally confused #16

Open c2b2pss opened 3 months ago

c2b2pss commented 3 months ago

Hi,

The instructions of how to run CreateNetwork are truly murky because there is no folder called "annotated_tfbs" nor is there an example file. After a lot of head-scraching this is what I have come up with -- the "mapping_togene" file was the easiest part.

I ran the snakemake pipeline and I am attaching screenshots of my folders from the output folder. One is the whole folder, and the other is the subfolder "TFBS". And the subfolders of each TF....

  1. The CreateNetwork talks about ,bed file. I presume this is actually a folder of .bed files??? This I see from the example command and looking in the folder it references in

$ TOBIAS CreateNetwork --TFBS test_data/annotated_tfbs/* --origin test_data/motif2gene_mapping.txt

So the annotated_tfbs/* implies the folder of bed files as in the test data. Is that right? But there is no such folder in my snakemake outputs.

  1. So in the snakemake ouput this folder is not present, however, there is the TFBS folder (picture: "TOBIAS TFBS folder") that has individual TF files.

  2. Within each of these folders is a separate bed folder. Within this bed folder of each TF, is a "_bound.bed" file for each of my conditions.

  3. So this is the file I need to use. Can you please confirm that?

  4. But then I have to copy each of these "_bound.bed" files to a separate folder and per condition.

  5. Then I can run "CreateNetwork".

TOBIAS Output TOBIAS TFBS Bed folder TOBIAS TFBS folder

Thanks!

c2b2pss commented 3 months ago

UPDATE: What I outline above does not work, since it wants only one input file. I cannot input multiple files or a collection of bed files.....can I?

After some more juggling, I ran the CreateNetwork with a single "_bound.bed" file for one TF. This gives me a 1:1 relationship between the TF and other TFs. I got the following files

  1. adjacency.txt
  2. ANDR_HUMAN.H11MO.1.A_path_edges.txt
  3. ANDR_HUMAN.H11MO.1.A_paths.txt
  4. edges.txt

A. If I look at file 2, it is just a 1:1 between ANDR (androgen receptor) and another TF. Since this is a "bound" file to begin with, am I looking at TFs that ANDR regulates or whose gene has a ANDR binding site? SInce this is a 1:1 file there seems to be no reason to make a figure out of it.

B. If I look at file 4, edges.txt, it has all the peak information etc., and the last column is the TFs listed in file 2. The columns have no labels though, it just says SItes1, Sites 2 etc.

C. How do I create a deeper network? Can I integrate more than 1 TF bed file?

Thanks!

C. What program do you suggest I visualize all this with?

mohobein commented 3 months ago

Hey,

you are right, to create proper networks, you will want to use all _bound.bed files from your BINDetect analysis. The pipeline annotates each TFBS with gene IDs of associated genes (=genes that are regulated by the TF being bound there), which are used to create the networks. Because this is not the default output of BINDetect, as these TFBS have not yet received an annotation, the test data has a separate directory with such prepared bed files. However, as I said, the pipeline already handles the annotation and modifies your _bound.bed files with it, so you can just use them for CreateNetwork without additional steps. To use them all at the same time as input (given that you have the default BINDetect output structure), you can use --TFBS TFBS/*/beds/*_<condition>_bound.bed as argument to get the network for one of your conditions (you have to replace <condition> with the name of the condition you want to analyze). The * just means that all characters are allowed here, so all files that fit the other characters but are different at these * parts of the path will be included.

One thing you have to keep in mind is that your --origin file has to match your organism and the TF names have to match the naming scheme used in your analysis. If you are working with human data and the TF names are identical to those in your output, you can use the motif2gene_mapping.txt from the test data. However, if you want to use a specific annotation version or have a different organism, you will need to create one yourself. This issue describes how to create the file from scratch. You can just take the genes.gtf file for your organism and get all lines where gene_name matches one of your TF motif names from your JASPAR file. Each line left contains both the gene_name and gene_id, which you can then use to fill your two columns for you --origin file. But as you were already able to run CreateNetwork with just one file, this seems to work for you already.

The ANDR_HUMAN.H11MO.1.A_path_edges.txt output file you are talking about shows all genes that had ANDR bound at their promoter, not only a binding site present (though if you wanted to check for that too, you could use the _all.bed files instead of the _bound.bed files). Therefore, ANDR should influence this gene's expression in some way.

File 4 is a global overview containing all information of all TFBS with an annotated target gene + their target gene. So instead of just showing which TF influences the expression of which gene, it also mentions the specific TFBS and other information contained in the input bed file. If you have more input files, it will merged all the information of every TF in this file, while the TF specific files only contain the paths there the specific TF is involved.

The output files can be used as input for any software that create network visualization. There are a few, one I can recommend is Cytoscape, but you can use a tool of your choice. You can either use the global edges.txt file containing all factors (also contains more information), or the individual networks for each TF (<TF>_path_edges.txt). Within Cytoscape, you can load the network using: "File" -> "Import" -> "Network from file" and set the appropriate Source and Target nodes. You can then use Cytoscape to visualize the network with different layouts.

I hope this clears up your questions!

c2b2pss commented 3 months ago

Hi,

Fantastic! Thanks for your clear explanation and your patience. I really appreciate it.

I will have to try the multiple bed files and report back to you.

Thanks!

c2b2pss commented 3 months ago

UPDATE: So if I point to the folder with the multiple beds and try to input multiple beds at once like this: (TOBIAS_ENV) premsubramaniam@pop-os:~/TOBIAS_snakemake/createnetwork_output$ TOBIAS CreateNetwork --TFBS '/home/user/TOBIAS_snakemake/LNCAP_ALL/Copied_Beds/PR_beds/*_PR_bound.bed' --origin '/home/user/TOBIAS_snakemake/HOCOMOCOv11_full_annotation_HUMAN_mono.tsv'

I get the error:

ERROR: File "/home/user/TOBIAS_snakemake/LNCAP_ALL/Copied_Beds/PR_beds/*_PR_bound.bed" does not exists

Stumped here....It does not want to take multiple files??

The default structure --TFBS TFBS/*/beds/*_<condition>_bound.bed also gives same error.

Regards!

c2b2pss commented 3 months ago

UPDATE 2: So after a lot of commands I conclude it needs an explicit reference to a SINGLE input .bed file. Anything else does not work.

mohobein commented 3 months ago

I can assure you, you can and usually should use multiple input bed files. The problem with your attempt was the use of quotation marks ' in conjunction with wildcards *. If remove the '' from your --TFBS argument, it should work if everything else is correct. Using quotation marks tells the computer that it should take the string input literally as it was written, but in your case, because you used a wildcard to find multiple files with slightly different names, that is not what you want.

c2b2pss commented 3 months ago

You're really good! :-)

Yup, removing the '....' and it reads the whole folder of beds. Of course, for all the HOCOMO TFs the output exceeds 200GB... phew!.

c2b2pss commented 3 months ago

I ran CreateNetwork on a batch of " _bound.bed" files and got

adjancent.txt edges.txt a paths and edges file for each TF for which inputted the bed file. The edges.txt file output is attached.

What are the column names? Is this the file to use for looking at the whole network? Which column is source and which is target? At the last column there is are names of TF matched to my input mapping files. However, in colmn 4 ="Sites 3" the original HOCOMOCO names are still retained. If you can please clarify which file to use to build the network and how the connections go it would be very helpful! edges.txt