kaizhang / Taiji

This project has been moved to:
https://github.com/Taiji-pipeline/Taiji
9 stars 3 forks source link

Find_TF_sites empty #6

Closed npklein closed 7 years ago

npklein commented 7 years ago

Hi @kaizhang, sorry I got another issue and haven't been able to solve it.

The Find_TF_sites does not search for me:

./taiji-Linux-x86_64-static cat Find_TF_sites []

So I looked back at previous steps and seems _prepare also did not get the data

./taiji-Linux-x86_64-static cat Find_TF_sites_prepare _context: output//TFBS//openChromatin.bed _data: []

but the step ATAC_callpeaks does seem to have worked ( I can upload full file if needed)

./taiji-Linux-x86_64-static cat ATAC_callpeaks | head -n50

  • atacseqPairedEnd: false atacseqCommon: _commonCellType: '' _commonEid: time0_DHSeq _commonReplicates:
    • replicateFiles:
      • tag: Single contents: fileFormat: NarrowPeakFile fileInfo: FRiP: '3.508873427463761e-2' fileLocation: output//ATAC_Seq//time0_DHSeq_rep0_MACS.narrowPeak fileTags:
        • macs2 replicateInfo: {} replicateNumber: 0
    • replicateFiles: [] replicateInfo: {} replicateNumber: 1 _commonGroupName: time0

And the .narrowPeak file does get written. I checked if I did not have the same problem as before that the chromosomes had chr in front of them, but this is not the case:

head -n20 output//ATAC_Seq//time0_DHSeq_rep0_MACS.narrowPeak 1 713908 714316 NA_peak_1 481 . 7.36695 52.43988 48.16940 170 1 762804 762976 NA_peak_2 58 . 2.85657 8.47912 5.87589 86 1 825869 825989 NA_peak_3 73 . 3.08209 10.02857 7.35086 43 1 840035 840338 NA_peak_4 141 . 3.98416 17.09233 14.13533 125 1 894460 894793 NA_peak_5 292 . 4.61581 32.71703 29.23805 241 1 894905 895117 NA_peak_6 68 . 2.59799 9.46076 6.81399 123 1 902086 902588 NA_peak_7 144 . 3.59994 17.38742 14.42110 193 1 948459 949247 NA_peak_8 448 . 6.58979 48.90823 44.80482 346 1 954643 955337 NA_peak_9 296 . 4.71723 33.11506 29.62207 460 1 968342 968757 NA_peak_10 83 . 3.23244 11.11313 8.38767 277 1 976015 976301 NA_peak_11 117 . 3.25132 14.62723 11.76132 200 1 994564 994722 NA_peak_12 81 . 3.00771 10.84707 8.13541 42 1 994778 995248 NA_peak_13 229 . 4.51701 26.26077 22.99343 190 1 999410 999535 NA_peak_14 71 . 2.80949 9.82831 7.16280 75 1 999646 999751 NA_peak_15 57 . 2.61376 8.33352 5.74213 33 1 1004554 1004811 NA_peak_16 228 . 4.44142 26.11546 22.85241 173 1 1004887 1005112 NA_peak_17 47 . 2.48832 7.27244 4.73782 140 1 1051278 1052240 NA_peak_18 516 . 5.95168 56.08030 51.62059 336 1 1057654 1057853 NA_peak_19 88 . 3.15172 11.56752 8.82611 120 1 1072729 1072909 NA_peak_20 89 . 3.30761 11.67020 8.92128 126

Any ideas where I went wrong?

kaizhang commented 7 years ago

@npklein Because the software did not find any motifs from your motif file, signaled by _data: []. Where did you get the motifs? The motif file should be in MEME format. You can look at the examples here: https://github.com/kaizhang/Taiji/tree/master/docs/data/motifs

npklein commented 7 years ago

@kaizhang The motif had indeed some erroneous lines. Using the correct file tho I get

[WARN][07-15 20:15] Find_TF_sites: Failed! [ERROR][07-15 20:15] "Find_TF_sites" failed. The error was: Bio.Seq.Query.openGenome: Incorrect format CallStack (from HasCallStack): error, called at src/Bio/Seq/IO.hs:44:14 in bioinformatics-toolkit-0.3.2-6jGTx2VGtZyEsm7mUTIiFH:Bio.Seq.IO.

which I guess comes from https://github.com/kaizhang/bioinformatics-toolkit/blob/master/bioinformatics-toolkit/src/Bio/Seq/IO.hs where you check if magic = "<HASKELLBIOINFORMATICS_7d2c5gxhg934>" is at the top an input file (I can't find which file you are reading here, guessing genome or genome index file from the variable names).

I'm not sure how this header would get in either of these files tho, I use the 1000G fasta genome reference and index it with samtools faidx.

kaizhang commented 7 years ago

@npklein

################################################################################
# You don't have to physically provide the following files. But you do need to
# specify the locations where these files will be *GENERATED AUTOMATICALLY WHEN
# FILES/DIRECTORIES DOES NOT EXIST*. If the specified directories or files
# already exist, the program will do nothing.
# If this is the first time you run the program, make sure delete existing
# files/directories first so indices can be generated properly.
# You only need to generate the indices once, *THEY CAN BE REUSED*.
################################################################################

# This is the *FILE* containing GENOME SEQUENCE INDEX.
seqIndex: "/home/kai/genome/GRCh38/GRCh38.index"

You should not generate this file by yourself. Delete the file you have generated, re-run "Initialization". The program will generate the correct index for you.

npklein commented 7 years ago

D'oh! Was using a config without the comments and didn't think about that. I got my ranks now, thanks for all the help!