Test figures - Githubissues

adelosan521 commented 1 year ago

Hi,

I am trying to validate whether I have TOBIAS working. Are there test output files / figures for the test data that can be used to validate whether my code is working?

Niko

msbentsen commented 1 year ago

Hi Niko,

Yes, have a look here: https://github.com/loosolab/TOBIAS/wiki. On the right side, each "tool overview" contains a test command to run. The test data can be downloaded as explained on this page: https://github.com/loosolab/TOBIAS/wiki/test-data . If they run without error, TOBIAS was correctly installed.

adelosan521 commented 1 year ago

Are there test output files so that I can check whether I am reconstructing the analysis exactly?

Niko

msbentsen commented 1 year ago

Unfortunately not - they serve as a test of installation. But there is a very low chance that something would be wrong if the tools run through, since all necessary packages would be loaded and shown functioning.

adelosan521 commented 1 year ago

Do you think it might be possible to replicate generating the figures in the Nature Communications paper? I really want to figure out if the pipeline is actually working as expected.

Also, is it possible to use Genrich instead of MACS2 to call peaks?

msbentsen commented 1 year ago

Do you think it might be possible to replicate generating the figures in the Nature Communications paper? I really want to figure out if the pipeline is actually working as expected.

Can you explain to me what you mean with "working as expected"? The figures in the paper can be replicated by using the exact same data (as described in the methods), but whether you can replicate that with other data, I cannot say. If the tools run on the test data, that shows that there is nothing wrong with the installation, which means the tool is "working as expected" on given input data.

Also, is it possible to use Genrich instead of MACS2 to call peaks?

Yes, you can use any type of peak-caller as long as the peaks are given in .bed-format.

Best Mette

adelosan521 commented 1 year ago

Hi,

I ran the test data through a pipeline and compared it to Supplementary Figure 2 and D of the TOBIAS paper. The results look quite different. I didn't use UROPA. Is it possible that the very different results are due to lack of UROPA?

adelosan521 commented 1 year ago

Hi Mette,

Sorry to bother you. I added UROPA to the pipeline and ran the test data through a pipeline and compared it to Supplementary Figure 2 and D of the TOBIAS paper. The results look quite different. Would it be possible to discuss why this might be the case?

msbentsen commented 1 year ago

Hi Niko,

Is this the figure you mean?

The test data is only a small subset of reads from a Bcell-line (and only from chr4), so the results will not be comparable to the ones in the paper unfortunately. There is some more information about the source of the test data here if you would like. So you probably cannot replicate the results of the paper using the test data - but if you have any data yourself, you might try to compare it to that? For some TFs, you will need data from the full genome in order to see the aggregate footprints as seen in the figure above. So that is the reason for the difference.

Regarding UROPA - no, that does not have any influence on the footprint shapes. UROPA does not alter the peak .bed-file, but only adds extra information (annotation) to each peak. So the differences are not due to using/not using UROPA. I hope that helps you out.

adelosan521 commented 1 year ago

Hi,

I also tried to download the embryo data that was featured in the TOBIAS paper, and ran it through a pipeline. My results are also quite different from that which is reported in the paper. I am pasting my results following ATACorrect, FootprintScores, BINDetect and PlotAggregate for CTCF in 2C data. Am I doing something incorrectly?

adelosan521 commented 1 year ago

To my knowledge, the only departure I had from the pipeline in your paper was that I used Genrich instead of MACS2 to call peaks.

msbentsen commented 1 year ago

Hi @adelosan521 ,

Ah I see - you are using _footprints.bw to visualize the aggregate signals. In the figure from the paper, we are using the _uncorrected.bw and _corrected.bw as output from TOBIAS ATACorrect.

The difference is that ATACorrect corrects the Tn5 signals, whereas ScoreBigwig calculates the footprint scores for the Tn5 signals. So if you need to see the Tn5 cutting signal, you need to use the _corrected.bw.

adelosan521 commented 1 year ago

Hi, thanks for your enormous help. I did run the _corrected.bw to visualize the aggregate signals and am still getting different results for both DUX4 and CTCF across the different developmental stages. Maybe there is something wrong with my code?

Code is:

TOBIAS ATACorrect --bam ERR1775454_Aligned.out_sorted.bam --genome GRCh38.primary_assembly.genome.fa --peaks ERR1775454_ATAC1.narrowPeak

TOBIAS FootprintScores --signal ERR1775454_Aligned.out_sorted_corrected.bw --regions ERR1775454_ATAC1.narrowPeak --output ERR1775454_footprints.bw --cores 8

TOBIAS BINDetect --motifs JASPAR2022_CORE_non-redundant_pfms_jaspar.txt --signals ERR1775454_footprints.bw --genome GRCh38.primary_assembly.genome.fa --peaks ERR1775454_ATAC1.narrowPeak --outdir project --cond_names neuron --cores 8

TOBIAS PlotAggregate --TFBS /project/DRGX_MA1481.1/beds/DRGX_MA1481.1_neuron_bound.bed --signals ERR1775454_Aligned.out_sorted_corrected.bw --output DRGX_footprint_comparison_all.pdf --share_y both --plot_boundaries --signal-on-x

I am really grateful for your help. I'm trying to get TOBIAS to work exactly as it worked in your paper!

New figures using _corrected.btw:

msbentsen commented 1 year ago

Those already looks better, that's great! Some things which might explain the differences:

Did you remove blacklisted regions from the ATAC-seq peaks? These are regions in the genome which are known to be problematic - as explained here: https://www.nature.com/articles/s41598-019-45839-z. Not removing these might explain the high signal values for 2C and 4C. You can give this as --blacklist to TOBIAS ATACorrect.
Which mapper and parameters did you use? We used STAR with parameters: --alignEndsType EndToEnd --outFilterMismatchNoverLmax 0.1 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMatchNmin 20 --alignIntronMax 1 --alignSJDBoverhangMin 999 --alignEndsProtrude 10 ConcordantPair --alignMatesGapMax 2000 --outMultimapperOrder Random --outFilterMultimapNmax 999 --outSAMmultNmax 1 as explained in methods, so differences in parameters might also lead to different Tn5 signals.

adelosan521 commented 1 year ago

Thanks again for spending time to help me.

I wanted to ask how did you generate the BED file with the blacklisted regions in your paper?

I used STAR. The STAR command I used was:

STAR alignEndsType EndToEnd --outFilterMismatchNoverLmax 0.1 --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66 --outFilterMatchNmin 20 --alignIntronMax 1 --alignSJDBoverhangMin 999 --alignEndsProtrude 10 ConcordantPair --alignMatesGapMax 2000 --outMultimapperOrder Random --outFilterMultimapNmax 999 --outSAMmultNmax 1 --runThreadN 8 --genomeDir /project/embryo/ --readFilesIn base_1_val_1.fq.gz base_2_val2.fq.gz --outFileNamePrefix /project/embryo/${base} --twopassMode Basic --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM Unsorted

Later on I sorted the BAM files using Sam Tools.

msbentsen commented 1 year ago

I wanted to ask how did you generate the BED file with the blacklisted regions in your paper?

We used the blacklist files given by the paper referenced above, found here: https://github.com/Boyle-Lab/Blacklist/. I hope that solves the issue with the differing aggregate plots.

adelosan521 commented 1 year ago

Hi,

Adding the blacklist, I think I might have been able to recreate some of the figures, although there are still some differences in my figures. Some are very similar, such as ESC CTCF shown below. However, my goal is to recreate your analyses exactly. Maybe it would be good to review exactly which files I am inputting into each of the commands?

Top image is from the TOBIAS paper and bottom figures are from my analyses.

msbentsen commented 1 year ago

Hi @adelosan521 ,

In the paper we are using the "all.bed"-files and not specifically the "_bound.bed"-ones. This is also visible in the number of sites from the original analysis (CTCF=85.000) vs. bound from your analysis (5.000). So maybe you can try to use the "_all.bed" file as input?

Besides that, there might be small differences due to different peak-caller, different version of mapping software etc. So you might have to adjust your expectations of an exact match between the two analysis. But I think it already looks quite good :-)

adelosan521 commented 1 year ago

Below I used all.bed and the results are below.

I still noticed that the number of sites is significantly lower than the ones reported in the figure. Do you know why this might be the case?

msbentsen commented 1 year ago

With regards to the number of sites, this might be due to lower number of peaks of Genrich vs. MACS2. We also used another motif for DUX4 than the one from JASPAR (this is described in methods of the paper), so this might also create small differences. But regardless, I think your results look to be close enough to count as similar to the paper.

adelosan521 commented 1 year ago

Ok, I will try switching to MACS2 and seeing if I can get more similar results.

I also did the same 8C to ICM comparison in Figure 3, and the findings have some similarities but there are still what seem to be significant differences. BINDirect seems to output a HTML file for the Volcano plot. How did you generate the Figure in your paper (so I can do a more direct comparison)?

This is a comparison of what you had (top) and I had (bottom) for some of the KLF transcription factors in the BINDirect PDF output.

Yours:

Mine:

adelosan521 commented 1 year ago

Hi,

I think I have been able to generate very similar figures. The only thing different now seems to be the number of binding sites. I have switched to MACS2. I was wondering if there are ways to make the number of binding sites similar?

msbentsen commented 1 year ago

Hi @adelosan521 ,

The number of peaks depends on the parameters to MACS2. We used the TOBIAS snakemake pipeline to do automatic peak-calling (pipeline is here: https://github.molgen.mpg.de/loosolab/TOBIAS_snakemake). Here, we also merge all peaks from 2C, 4C, 8C, ICM, hESC to one merged peaks-file, so that might also explain the difference in peaks.

There is no right or wrong thing - it just depends on what settings you find best for peak-calling.

adelosan521 commented 1 year ago

Hi,

I'm not able to use the Snakemake pipeline but I was wondering what parameters for MACS2 that the snakemake pipeline is using? I used the parameters in the paper and did merge the peaks file using bedtools merge. Are there alternative settings I should set?

"Accessible regions were identified by peak calling for each sample separately using MACS2 (parameters: --nomodel --shift -100 --extsize 200 --broad). Peaks from each sample were merged to a set of union peaks across all conditions using bedtools merge."

msbentsen commented 1 year ago

Hi @adelosan521 ,

The MACS2 parameters are correct. I realized we included another condition (naive hESC) in the run, which ended up not being presented in the paper. So that might explain the higher number of peaks. Here is the full list of peaks if you need to compare: https://gist.github.com/msbentsen/e59c355b07f6c73cc34b9569403ef497

But all in all, I think you proved that it works. There will always be small differences due to choices in preprocessing, different versions etc., but as long as the results are comparable, I would say that counts as reproducible.

adelosan521 commented 1 year ago

Hi,

Thanks again for being so attentive. I noticed in your list of peaks that there are indeed a lot of naive_hESC peaks.

Is the ATAC seq data for naive hESCs deposited online? I think I saw hESC datasets but I couldn't (unless I am mistaken) find any that were specifically annotated as "naive hESC".

If this is the case we might be able to close the gap in terms of number of peaks.

msbentsen commented 1 year ago

Hi, happy new year! The data for naive hESC is found under these accessions: GSM2721904 and GSM2721905

They can also be found by searching "naive" in the global overview of the experiment at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101571

Hope this helps you out.

adelosan521 commented 1 year ago

Hi, Did you merge the files from all the naive data (both Jaenisch and Austin) together?

msbentsen commented 1 year ago

Hi, Did you merge the files from all the naive data (both Jaenisch and Austin) together?

Yes, the files were a merge of both - which is probably why it ended up not being included. So your results might actually end up being better by excluding these samples.

adelosan521 commented 1 year ago

What do you mean by it not being included?

Also, I have tried to replicate 3b and have been getting different results. I am pasting the figures below using two different peak files (all peaks and just 8C/ICM). Is there a way to reduce the number of annotated data points? It's difficult to see. I do think there are some similarities (OTX enrichment in 8C, POU5F1B in ICM).

msbentsen commented 1 year ago

What do you mean by it not being included?

Sorry, I was not clear. I mean that the naive hESC samples were included in the calculation, but ended up not being included in Figure 2. So if you want to improve on the results, you could exclude the naive hESC peaks.

Regarding the volcano plots, you seem to have a lot more motifs included in the calculation. We used a subset of motifs from JASPAR and HOCOMOCO as described in Methods -> "Processing of TF motifs" of the paper. Starting by limiting the motifs to human motifs will help a lot.

But when I look at the general motifs, I find that the trend in your figure is similar to the paper - so I will state again: I think you have proven that the tool works and that the results are reproducible 🙂

adelosan521 commented 1 year ago

Thanks. I had a question. How did you create the reduced list of TFs to feed into TOBIAS? This is what is written in the methods.

"For each TOBIAS run, we created sets of expressed TFs as estimated from RNA-seq in the respective conditions. This amounted to 590 motifs for the dataset on human preimplantation stages, 464 motifs for the dataset on mouse preimplantation, and 459 for the DuxOE dataset."

msbentsen commented 1 year ago

"For each TOBIAS run, we created sets of expressed TFs as estimated from RNA-seq in the respective conditions. This amounted to 590 motifs for the dataset on human preimplantation stages, 464 motifs for the dataset on mouse preimplantation, and 459 for the DuxOE dataset."

We did exactly like it says there: We took the RNA-seq results of the same experiment, and used it to extract the list of expressed TFs (excluding TFs not expressed). But for limiting the number of TFs in the plot, it should be fine for you to just limit it to human motifs. That should clear the plot up significantly.

Is there a reason why you are trying to recreate the exact results of the paper? It's not that I don't want to help, but I think we are reaching a limit of how detailed such a comparison needs to be.

adelosan521 commented 1 year ago

I am about to carry out an analysis of a large dataset so I want to make sure I have the correct pipeline. I agree that it has been sufficiently reproduced.

One question I have. Is it possible to confirm that the "dip" in the signal in the aggregate TF footprint figures is indeed the right motif? For example in the image below, is it possible to confirm that where the dip is occurring is the CTCF motif?

msbentsen commented 1 year ago

One question I have. Is it possible to confirm that the "dip" in the signal in the aggregate TF footprint figures is indeed the right motif? For example in the image below, is it possible to confirm that where the dip is occurring is the CTCF motif?

The aggregate is centered on all positions of the CTCF motif (41201 occurrences), so we can confirm that there is a CTCF motif on every position. Is this what you mean? Whether motif is "correct" might be a discussion of the motif source and how it was created - such confirmation would probably require experimental validation.

loosolab / TOBIAS

Test figures #173