kundajelab / atac_dnase_pipelines

ATAC-seq and DNase-seq processing pipeline
BSD 3-Clause "New" or "Revised" License
161 stars 81 forks source link

Error in ATAqC (ValueError: all the input array dimensions except for the concatenation axis must match exactly) #50

Closed mdurante1 closed 7 years ago

mdurante1 commented 7 years ago

Thank you for making this pipeline available. I am able to get the pipeline running except when I include the ATAqC step I obtain an error in numpy. I have included the full log file below.

https://gist.github.com/mdurante1/3f51ef55b2fb0758949a1c68cacb6d28

Do you have any insight on how to fix this error?

            Traceback (most recent call last):
              File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in <module>
                main()
              File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1460, in main
                ROADMAP_META, OUTPUT_PREFIX)
              File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 905, in compare_to_roadmap
                sample_mean0_col)
              File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/scipy/stats/stats.py", line 3310, in spearmanr
                rs = np.corrcoef(ar, br, rowvar=axisout)
              File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2145, in corrcoef
                c = cov(x, y, rowvar)
              File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2024, in cov
                X = np.vstack((X, y))
              File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
                return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
            ValueError: all the input array dimensions except for the concatenation axis must match exactly
    StdOut (100000000 lines)  :
            Default case invoked for:
               opcode  = 153, "LoopLimit"
leepc12 commented 7 years ago

It looks like your local version of metaseq python package conflicts with numpy in the conda environment. Can you try the following to debug?

$ python
>>> import metaseq
mdurante1 commented 7 years ago

When running the command I obtained:

~/.local/lib/python2.7/site-packages/subprocess32.py

I proceeded to remove the subprocess32 files and removed my local installation of metaseq but I am still receiving the same error after ataqc runs for ~8 hours. Also the TSS plot (*.trim.PE2SE_vplot.png) is being generated so it seems that metaseq is running partly. Any further suggestions?

Here is the log file from the latest run: https://gist.github.com/mdurante1/0f479edd2beefd54db25da5499a6c04d

vervacity commented 7 years ago

Hi,

I see from your comment that you might have an older version of ATAqC (we've renamed vplot.png to tss-enrich.png) - could you try uninstall the whole pipeline and then reinstall, making sure to clone recursively to get the package? And also get the latest version of the genome annotations, downloading as described in the instructions.

Thanks! Daniel

mdurante1 commented 7 years ago

I re-installed the pipeline last week so this is a relatively new install. I just uninstalled all dependencies, removed the whole pipeline, and cloned it recursively using "git clone https://github.com/kundajelab/atac_dnase_pipelines --recursive" I reinstalled the dependencies and then checked the "run_ataqc.py" file in the ~/atac_dnase_pipelines/ataqc/ directory and it seems as if the old ATAqC version is being loaded:

def make_vplot(bam_file, tss, prefix, chromsizes, read_len, bins=400, bp_edge=2000,
               processes=8, greenleaf_norm=True):
    '''
    Take bootstraps, generate V-plots, and get a mean and
    standard deviation on the plot. Produces 2 plots. One is the
    aggregation plot alone, while the other also shows the signal
    at each TSS ordered by strength.
    '''
    logging.info('Generating vplot...')
    vplot_file = '{0}_vplot.png'.format(prefix)
    vplot_large_file = '{0}_large_vplot.png'.format(prefix)

Should the new version of ATAqC be loaded automatically or do I have to clone it separately?

vervacity commented 7 years ago

Sorry, we were a commit behind in the sync with ATAqC. It's updated now, please try again. Thanks!

mdurante1 commented 7 years ago

Hi Daniel,

After reinstalling the updated pipeline and redownloading the genomes, I ran the samples again and I am obtaining the same error I obtained previously. I checked the qc directory and I now see the properly named TSS plot (.trim.PE2SE_large_tss-enrich.png) and a gc plot (.trim.PE2SE_gcPlot.pdf), which was not there before. If I add -no_ataqc to the script and resubmit the job it completes the analysis so it seems to be a specific ATAqC problem. Any other thoughts on how to fix this error?

Thanks! Michael

vervacity commented 7 years ago

Hi Michael,

Looking more carefully at this, we've actually resolved your issue - this is a new downstream issue, related to comparing your samples to publicly available ATAC. Can you look in your qc folder and look for a test.log file if you still have it? That might have some more information for me to look at to see what's going on.

Daniel

mdurante1 commented 7 years ago

Here is the test.log file from the latest run: https://gist.github.com/mdurante1/8dc28ce6195b7b54fdfee7054524082c

Hope this helps! Michael

vervacity commented 7 years ago

Could you do two things for me: 1) In your error log, it makes a note that your BAM index file (.bai) is older than the BAM file. Could you do a clean re-run (if it doesn't take too long to do the re-run)? 2) Could you run which conda in your terminal where you run BDS and share the output path?

Daniel

mdurante1 commented 7 years ago

Hi Daniel,

  1. I can do a clean re-run although this will take 3-4 days. I will post the output when the run is complete.
  2. when I do which conda the output is: ~/miniconda3/envs/bds_atac/bin/conda

Michael

vervacity commented 7 years ago

Sounds good. I've also received feedback from someone else who ran into this issue:

"I removed miniconda and reinstalled miniconda version 4.0.5. I then removed/re-cloned the bds_atac pipeline. I then used uninstall_dependencies.sh and then install_dependencies.sh"

So basically try a full uninstall, set up miniconda first, then clone the pipeline, then the dependencies. Hope this helps!

vervacity commented 7 years ago

Also to do a quick test, do consider subsampling your FASTQ files and just running one of them to make sure that it can make it through ATAqC. This will decrease the time to debug before having to do a full re-run of everything :)

mdurante1 commented 7 years ago

Hi Daniel,

I did a clean run through and I am now getting a different error:

Traceback (most recent call last):
  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in <module>
    main()
  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1413, in main
    [flagstat, mapped_count] = get_samtools_flagstat(ALIGNED_BAM)
  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 589, in get_samtools_flagstat
    return flagstat, mapped_reads
UnboundLocalError: local variable 'mapped_reads' referenced before assignment

Here is the log file from the latest run: https://gist.github.com/mdurante1/ada64d8f842812e984857a7ac90c6069

I have also included the test.log file: https://gist.github.com/mdurante1/30bf1378c550948a653122fd5c184d3c

Thanks, Michael

vervacity commented 7 years ago

It's a pysam issue - there are different outputs in versions 0.9+ (https://github.com/pysam-developers/pysam/issues/245 in case you're curious).

We've pushed a hot fix, go ahead and pull the pipeline again (recursively to get the ataqc submodule). no need to reinstall anything.

mdurante1 commented 7 years ago

Hi Daniel,

I did another clean run through with all the suggested changes and I am now getting a different error:

                Traceback (most recent call last):
                  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in <module>
                    main()
                  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1413, in main
                    [flagstat, mapped_count] = get_samtools_flagstat(ALIGNED_BAM)
                  File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 582, in get_samtools_flagstat
                    results = pysam.flagstat(bam_file, split_lines=TRUE)
                NameError: global name 'TRUE' is not defined

Here's the log file of this run: https://gist.github.com/mdurante1/caee77db53813e0c990f35abf5d3cace

test.log: https://gist.github.com/mdurante1/bab061d3e306b6f0e9412448c22e851e

Thanks, Michael

vervacity commented 7 years ago

Hi Michael,

Thanks - so I made a big goof and used R syntax accidentally. It's fixed and in the latest ataqc module, so go ahead and do the same as before (just pull the pipeline recursively to get the submodule, no reinstallations needed) and that bug should go away.

Daniel

mdurante1 commented 7 years ago

Hi Daniel,

Your fixed worked for that bug but now I am receiving the original error again:

        Traceback (most recent call last):
          File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in <module>
            main()
          File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1460, in main
            ROADMAP_META, OUTPUT_PREFIX)
          File "/nethome/mdurante/atac_dnase_pipelines/ataqc/run_ataqc.py", line 905, in compare_to_roadmap
            sample_mean0_col)
          File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/scipy/stats/stats.py", line 3310, in spearmanr
            rs = np.corrcoef(ar, br, rowvar=axisout)
          File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2145, in corrcoef
            c = cov(x, y, rowvar)
          File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2024, in cov
            X = np.vstack((X, y))
          File "/nethome/mdurante/miniconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
            return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
        ValueError: all the input array dimensions except for the concatenation axis must match exactly

Here are the corresponding log files: https://gist.github.com/mdurante1/20997f4b9a57f89f8c10464b4910f5c5 https://gist.github.com/mdurante1/a52c7db10f97e1e0748da4de0ce9be3e

rfarouni commented 7 years ago

@vervacity Hi Daniel,

I am also having the same problem. I get this error

        # SYS command. line 145

         TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."
    StdErr (100000000 lines)  :
        Picked up _JAVA_OPTIONS: -Xms256M -Xmx15G -XX:ParallelGCThreads=1
        ERROR   2017-05-18 23:04:55 ProcessExecutor Warning messages:
        ERROR   2017-05-18 23:04:55 ProcessExecutor 1: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 2: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 3: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 4: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 5: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 6: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        ERROR   2017-05-18 23:04:55 ProcessExecutor 7: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
        ERROR   2017-05-18 23:04:55 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
        Picked up _JAVA_OPTIONS: -Xms256M -Xmx15G -XX:ParallelGCThreads=1
        [bam_sort_core] merging from 26 files...
        [bam_sort_core] merging from 26 files...
        Picked up _JAVA_OPTIONS: -Xms256M -Xmx15G -XX:ParallelGCThreads=1
        Traceback (most recent call last):
          File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in <module>
            main()
          File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1413, in main
            [flagstat, mapped_count] = get_samtools_flagstat(ALIGNED_BAM)
          File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 582, in get_samtools_flagstat
            results = pysam.flagstat(bam_file, split_lines=TRUE)
        NameError: global name 'TRUE' is not defined

Fatal error: /hd2/atac_dnase_pipelines/atac.bds, line 1525, pos 2. Task/s failed.
atac.bds, line 74 : main()
atac.bds, line 77 : void main() { // atac pipeline starts here
atac.bds, line 97 :     ataqc()
atac.bds, line 1514 :   void ataqc() {
atac.bds, line 1525 :       wait

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

Here is the test. log file in the qc folder.

INFO:root:Reading bowtie alignment log...
INFO:root:18507448 reads; of these:
INFO:root:18507448 (100.00%) were paired; of these:
INFO:root:189484 (1.02%) aligned concordantly 0 times
INFO:root:8274240 (44.71%) aligned concordantly exactly 1 time
INFO:root:10043724 (54.27%) aligned concordantly >1 times
INFO:root:----
INFO:root:189484 pairs aligned concordantly 0 times; of these:
INFO:root:26354 (13.91%) aligned discordantly 1 time
INFO:root:----
INFO:root:163130 pairs aligned 0 times concordantly or discordantly; of these:
INFO:root:326260 mates make up the pairs; of these:
INFO:root:233407 (71.54%) aligned 0 times
INFO:root:21280 (6.52%) aligned exactly 1 time
INFO:root:71573 (21.94%) aligned >1 times
INFO:root:99.37% overall alignment rate
INFO:root:Getting mitochondrial chromosome fraction...
INFO:root:Getting GC bias...
INFO:root:java -Xmx4G -jar /home/rick/anaconda3/envs/bds_atac/bin/../share/picard*/picard.jar CollectGcBiasMetrics R=/hd2/bds_genome_data/mm9/mm9.fa I=/hd2/atacseq_taf/output/ATAC_5lKO/out/align/rep2/ATAC_5_L002_R1_001.trim.PE2SE.nodup.bam O=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE_gc.txt VERBOSITY=ERROR QUIET=TRUE ASSUME_SORTED=FALSE CHART=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE_gcPlot.pdf S=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE_gcSummary.txt
INFO:root:Running preseq...
INFO:root:preseq lc_extrap -P -B -o /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE.preseq.dat /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE.sorted.bam -v 2> /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5_L002_R1_001.trim.PE2SE.preseq.log
INFO:root:samtools mapq 30...
INFO:root:Running Picard MarkDuplicates...
INFO:root:samtools flagstat...
vervacity commented 7 years ago

@rfarouni you probably grabbed the commit right before we corrected it - go ahead and pull the latest version (making sure to do recursive to get the latest ataqc module) and it should work.

@mdurante1 turns out that we had an annotation switch in mm9. this has now been corrected and should run properly. please pull the latest commit and download the latest genome data annotations. (also as an aside, I would personally recommend making the jump to mm10, if it fits with your other analyses!)

rfarouni commented 7 years ago

@vervacity Thanks. That fixed it, but another error came up several steps down the pipeline. Here is the error

--------------------Stderr--------------------
Picked up _JAVA_OPTIONS: -Xms256M -Xmx20G -XX:ParallelGCThreads=1
ERROR   2017-05-20 17:51:40 ProcessExecutor Warning messages:
ERROR   2017-05-20 17:51:40 ProcessExecutor 1: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 2: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 3: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 4: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 5: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 6: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
ERROR   2017-05-20 17:51:40 ProcessExecutor 7: In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH,  :
ERROR   2017-05-20 17:51:40 ProcessExecutor   zero-length arrow is of indeterminate angle and so skipped
Picked up _JAVA_OPTIONS: -Xms256M -Xmx20G -XX:ParallelGCThreads=1
[bam_sort_core] merging from 26 files...
[bam_sort_core] merging from 26 files...
Picked up _JAVA_OPTIONS: -Xms256M -Xmx20G -XX:ParallelGCThreads=1
Picked up _JAVA_OPTIONS: -Xms256M -Xmx20G -XX:ParallelGCThreads=1
processing chromosomes
Traceback (most recent call last):
  File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1598, in 
    main()
  File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 1460, in main
    ROADMAP_META, OUTPUT_PREFIX)
  File "/hd2/atac_dnase_pipelines/ataqc/run_ataqc.py", line 905, in compare_to_roadmap
    sample_mean0_col)
  File "/home/rick/anaconda3/envs/bds_atac/lib/python2.7/site-packages/scipy/stats/stats.py", line 3310, in spearmanr
    rs = np.corrcoef(ar, br, rowvar=axisout)
  File "/home/rick/anaconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2145, in corrcoef
    c = cov(x, y, rowvar)
  File "/home/rick/anaconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2024, in cov
    X = np.vstack((X, y))
  File "/home/rick/anaconda3/envs/bds_atac/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

and here is the test.log file

INFO:root:Reading bowtie alignment log...
INFO:root:18507448 reads; of these:
INFO:root:18507448 (100.00%) were paired; of these:
INFO:root:189484 (1.02%) aligned concordantly 0 times
INFO:root:8274240 (44.71%) aligned concordantly exactly 1 time
INFO:root:10043724 (54.27%) aligned concordantly >1 times
INFO:root:----
INFO:root:189484 pairs aligned concordantly 0 times; of these:
INFO:root:26354 (13.91%) aligned discordantly 1 time
INFO:root:----
INFO:root:163130 pairs aligned 0 times concordantly or discordantly; of these:
INFO:root:326260 mates make up the pairs; of these:
INFO:root:233407 (71.54%) aligned 0 times
INFO:root:21280 (6.52%) aligned exactly 1 time
INFO:root:71573 (21.94%) aligned >1 times
INFO:root:99.37% overall alignment rate
INFO:root:Getting mitochondrial chromosome fraction...
INFO:root:Getting GC bias...
INFO:root:java -Xmx4G -jar /home/rick/anaconda3/envs/bds_atac/bin/../share/picard*/picard.jar CollectGcBiasMetrics R=/hd2/bds_genome_data/mm9/mm9.fa I=/hd2/atacseq_taf/output/ATAC_5lKO/out/align/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.nodup.bam O=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE_gc.txt VERBOSITY=ERROR QUIET=TRUE ASSUME_SORTED=FALSE CHART=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE_gcPlot.pdf S=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE_gcSummary.txt
INFO:root:Running preseq...
INFO:root:preseq lc_extrap -P -B -o /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.preseq.dat /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.sorted.bam -v 2> /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.preseq.log
INFO:root:samtools mapq 30...
INFO:root:Running Picard MarkDuplicates...
INFO:root:samtools flagstat...
INFO:root:37014896 + 0 in total (QC-passed reads + QC-failed reads)
INFO:root:0 + 0 secondary
INFO:root:0 + 0 supplementary
INFO:root:0 + 0 duplicates
INFO:root:36781489 + 0 mapped (99.37%:-nan%)
INFO:root:37014896 + 0 paired in sequencing
INFO:root:18507448 + 0 read1
INFO:root:18507448 + 0 read2
INFO:root:36635928 + 0 properly paired (98.98%:-nan%)
INFO:root:36758530 + 0 with itself and mate mapped
INFO:root:22959 + 0 singletons (0.06%:-nan%)
INFO:root:33546 + 0 with mate mapped to a different chr
INFO:root:17237 + 0 with mate mapped to a different chr (mapQ>=5)
INFO:root:final read counts...
INFO:root:insert size distribution...
INFO:root:java -Xmx4G -jar /home/rick/anaconda3/envs/bds_atac/bin/../share/picard*/picard.jar CollectInsertSizeMetrics INPUT=/hd2/atacseq_taf/output/ATAC_5lKO/out/align/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.nodup.bam OUTPUT=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.inserts.hist_data.log H=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.inserts.hist_graph.pdf VERBOSITY=ERROR QUIET=TRUE W=1000 STOP_AFTER=5000000
INFO:root:Generating tss plot...
INFO:root:signal to noise...
INFO:root:bigWigAverageOverBed /hd2/atacseq_taf/output/ATAC_5lKO/out/signal/macs2/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.nodup.tn5.pf.pval.signal.bigwig /hd2/bds_genome_data/mm9/ataqc/mm9_univ_dhs_ucsc.from_mm10.bed.gz /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep2/ATAC_5lKO_TAAGGCGA_L002_R1_001.trim.PE2SE.signal
vervacity commented 7 years ago

@rfarouni ok I see. You need to update your species config and files - go ahead and re-download the genome data (https://github.com/kundajelab/atac_dnase_pipelines#genome-data) and make sure in your species file ([SPECIES FILE] as noted in the installation instructions) that you have a line has something like this under [mm9]: reg2map_bed = /path/to/pipeline_genome_data/mm9/ataqc/mm9_dhs_universal_ucsc_v1.bed.gz

rfarouni commented 7 years ago

@vervacity I re-downloaded the genome data, but there is no such file in my genome_data folder. The file mm9/ataqc/mm9_dhs_universal_ucsc_v1.bed.gz is not there. I have this though dnase = /hd2/bds_genome_data/mm9/ataqc/mm9_univ_dhs_ucsc.from_mm10.bed.gz

vervacity commented 7 years ago

@rfarouni so I'm seeing that it's listed in the latest genome data download script (https://github.com/kundajelab/atac_dnase_pipelines/blob/master/install_genome_data.sh#L64) and is also available in our genome data folder (http://mitra.stanford.edu/kundaje/genome_data/mm9/ataqc/) - could you grab the latest genome download script and try again?

rfarouni commented 7 years ago

I downloaded the file and added the filepath in the species file. It fixed it and the final report gets produced, but I still get one error

# SYS command. line 30

 if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/hd2/atac_dnase_pipelines/.:/hd2/atac_dnase_pipelines/modules:/hd2/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

# SYS command. line 32

 zcat /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.gz | sort -k1,1 -k2,2n > /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.tmp

# SYS command. line 33

 cat /hd2/bds_genome_data/mm9/mm9/mm9.chrom.sizes | grep -P 'chr[\dXY]+[ \t]' > /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.chrsz.tmp

# SYS command. line 34

 bedToBigBed -type=bed12+3 -as=/hd2/atac_dnase_pipelines/etc/gappedPeak.as /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.tmp /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.chrsz.tmp /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb

# SYS command. line 35

 rm -f /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.tmp /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.chrsz.tmp

# SYS command. line 39

 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."

End coordinate 61342464 bigger than chr19 size of 61342430 line 180743 of /hd2/atacseq_taf/output/ATAC_5lKO/out/peak/macs2/pooled_rep/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5_pooled.pf.filt.gappedPeak.bb.tmp

the test log

INFO:root:Reading bowtie alignment log...
INFO:root:18014559 reads; of these:
INFO:root:18014559 (100.00%) were paired; of these:
INFO:root:226896 (1.26%) aligned concordantly 0 times
INFO:root:8040724 (44.63%) aligned concordantly exactly 1 time
INFO:root:9746939 (54.11%) aligned concordantly >1 times
INFO:root:----
INFO:root:226896 pairs aligned concordantly 0 times; of these:
INFO:root:49707 (21.91%) aligned discordantly 1 time
INFO:root:----
INFO:root:177189 pairs aligned 0 times concordantly or discordantly; of these:
INFO:root:354378 mates make up the pairs; of these:
INFO:root:232838 (65.70%) aligned 0 times
INFO:root:21941 (6.19%) aligned exactly 1 time
INFO:root:99599 (28.11%) aligned >1 times
INFO:root:99.35% overall alignment rate
INFO:root:Getting mitochondrial chromosome fraction...
INFO:root:Getting GC bias...
INFO:root:java -Xmx4G -jar /home/rick/anaconda3/envs/bds_atac/bin/../share/picard*/picard.jar CollectGcBiasMetrics R=/hd2/bds_genome_data/mm9/mm9/mm9.fa I=/hd2/atacseq_taf/output/ATAC_5lKO/out/align/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.bam O=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE_gc.txt VERBOSITY=ERROR QUIET=TRUE ASSUME_SORTED=FALSE CHART=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE_gcPlot.pdf S=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE_gcSummary.txt
INFO:root:Running preseq...
INFO:root:preseq lc_extrap -P -B -o /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.preseq.dat /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.sorted.bam -v 2> /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.preseq.log
INFO:root:samtools mapq 30...
INFO:root:Running Picard MarkDuplicates...
INFO:root:samtools flagstat...
INFO:root:36029118 + 0 in total (QC-passed reads + QC-failed reads)
INFO:root:0 + 0 secondary
INFO:root:0 + 0 supplementary
INFO:root:0 + 0 duplicates
INFO:root:35796280 + 0 mapped (99.35%:-nan%)
INFO:root:36029118 + 0 paired in sequencing
INFO:root:18014559 + 0 read1
INFO:root:18014559 + 0 read2
INFO:root:35575326 + 0 properly paired (98.74%:-nan%)
INFO:root:35772440 + 0 with itself and mate mapped
INFO:root:23840 + 0 singletons (0.07%:-nan%)
INFO:root:40314 + 0 with mate mapped to a different chr
INFO:root:16449 + 0 with mate mapped to a different chr (mapQ>=5)
INFO:root:final read counts...
INFO:root:insert size distribution...
INFO:root:java -Xmx4G -jar /home/rick/anaconda3/envs/bds_atac/bin/../share/picard*/picard.jar CollectInsertSizeMetrics INPUT=/hd2/atacseq_taf/output/ATAC_5lKO/out/align/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.bam OUTPUT=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.inserts.hist_data.log H=/hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.inserts.hist_graph.pdf VERBOSITY=ERROR QUIET=TRUE W=1000 STOP_AFTER=5000000
INFO:root:Generating tss plot...
INFO:root:signal to noise...
INFO:root:bigWigAverageOverBed /hd2/atacseq_taf/output/ATAC_5lKO/out/signal/macs2/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.nodup.tn5.pf.pval.signal.bigwig /hd2/bds_genome_data/mm9/mm9/ataqc/mm9_dhs_universal_ucsc_v1.bed.gz /hd2/atacseq_taf/output/ATAC_5lKO/out/qc/rep1/ATAC_5lKO_L001_R1_001.trim.PE2SE.signal
INFO:root:ENCFF001PCS   SpearmanrResult(correlation=0.033773658301836137, pvalue=0.0)
INFO:root:ENCFF001PDB   SpearmanrResult(correlation=0.045394639569752246, pvalue=0.0)
INFO:root:ENCFF001PDP   SpearmanrResult(correlation=0.03117305766539559, pvalue=0.0)
INFO:root:ENCFF001PEA   SpearmanrResult(correlation=0.064350783611477114, pvalue=0.0)
INFO:root:ENCFF001QEN   SpearmanrResult(correlation=0.11751503760067165, pvalue=0.0)
INFO:root:ENCFF001QBV   SpearmanrResult(correlation=0.058886984313049347, pvalue=0.0)
INFO:root:ENCFF001PZU   SpearmanrResult(correlation=0.048606934562090573, pvalue=0.0)
INFO:root:ENCFF001PFG   SpearmanrResult(correlation=0.058729132628544053, pvalue=0.0)
INFO:root:ENCFF001PGU   SpearmanrResult(correlation=0.11441127014544508, pvalue=0.0)
INFO:root:ENCFF001PHE   SpearmanrResult(correlation=0.084972152825066191, pvalue=0.0)
INFO:root:ENCFF001PHL   SpearmanrResult(correlation=0.10805913535450433, pvalue=0.0)
INFO:root:ENCFF001PHS   SpearmanrResult(correlation=0.092773761325087106, pvalue=0.0)
INFO:root:ENCFF001PIB   SpearmanrResult(correlation=0.099420586283102103, pvalue=0.0)
INFO:root:ENCFF001PNK   SpearmanrResult(correlation=0.17334718873462551, pvalue=0.0)
INFO:root:ENCFF001PVW   SpearmanrResult(correlation=0.089321528179288359, pvalue=0.0)
INFO:root:ENCFF001PIK   SpearmanrResult(correlation=0.32975026450162248, pvalue=0.0)
INFO:root:ENCFF001PIY   SpearmanrResult(correlation=0.32913102263702887, pvalue=0.0)
INFO:root:ENCFF001PKP   SpearmanrResult(correlation=0.092786146689878099, pvalue=0.0)
INFO:root:ENCFF001PKY   SpearmanrResult(correlation=0.086798762024097684, pvalue=0.0)
INFO:root:ENCFF001PLI   SpearmanrResult(correlation=0.19039392295092475, pvalue=0.0)
INFO:root:ENCFF001LTZ   SpearmanrResult(correlation=0.15161944771329813, pvalue=0.0)
INFO:root:ENCFF001LUI   SpearmanrResult(correlation=0.072168420005018871, pvalue=0.0)
INFO:root:ENCFF001PLW   SpearmanrResult(correlation=0.069049288296783431, pvalue=0.0)
INFO:root:ENCFF001PMI   SpearmanrResult(correlation=0.14039189856903245, pvalue=0.0)
INFO:root:ENCFF001PMR   SpearmanrResult(correlation=0.17890894073017335, pvalue=0.0)
INFO:root:ENCFF001QCS   SpearmanrResult(correlation=0.067259461370709289, pvalue=0.0)
INFO:root:ENCFF001PNU   SpearmanrResult(correlation=0.11656677312462088, pvalue=0.0)
INFO:root:ENCFF001PON   SpearmanrResult(correlation=0.090301961635109354, pvalue=0.0)
INFO:root:ENCFF001PPI   SpearmanrResult(correlation=0.14340252460503231, pvalue=0.0)
INFO:root:ENCFF001PTV   SpearmanrResult(correlation=0.088709205487457948, pvalue=0.0)
INFO:root:ENCFF001PUI   SpearmanrResult(correlation=0.041341040870329318, pvalue=0.0)
INFO:root:ENCFF001PWJ   SpearmanrResult(correlation=0.053202448119509249, pvalue=0.0)
INFO:root:ENCFF001PUV   SpearmanrResult(correlation=0.20143252463051384, pvalue=0.0)
INFO:root:ENCFF001QBE   SpearmanrResult(correlation=0.085697950810573756, pvalue=0.0)
INFO:root:ENCFF001PWW   SpearmanrResult(correlation=0.12324226976911377, pvalue=0.0)
INFO:root:ENCFF001PXH   SpearmanrResult(correlation=0.071244657821574187, pvalue=0.0)
INFO:root:ENCFF001OQF   SpearmanrResult(correlation=0.18337167266240909, pvalue=0.0)
INFO:root:ENCFF001PYR   SpearmanrResult(correlation=0.14473639332147181, pvalue=0.0)
INFO:root:ENCFF001PZE   SpearmanrResult(correlation=0.07972817386565216, pvalue=0.0)
INFO:root:ENCFF001PES   SpearmanrResult(correlation=0.069160517857457471, pvalue=0.0)
INFO:root:ENCFF001PFY   SpearmanrResult(correlation=0.075503324225279431, pvalue=0.0)
INFO:root:ENCFF001QAG   SpearmanrResult(correlation=0.1032940188731358, pvalue=0.0)
INFO:root:ENCFF001PKH   SpearmanrResult(correlation=0.30859281272295885, pvalue=0.0)
INFO:root:ENCFF001QHL   SpearmanrResult(correlation=0.28408478077383448, pvalue=0.0)
vervacity commented 7 years ago

@rfarouni can you pass along that gapped Peak file that had the error so that we can replicate it? thanks!

mdurante1 commented 7 years ago

Hi Daniel,

When I downloaded the latest mm9 genome annotations I still received the same concatenation error. I tried running the pipeline with mm10 and did not receive this error. My future analyses will be conducted with mm10 so I will close this issue. Thanks for all your help I really appreciate it!

Best regards, Michael

sckinta commented 6 years ago

Hi Daniel, I am wondering whether you have fixed mm9 genome annotation problem after previous issue post