kaizhang / Taiji

This project has been moved to:
https://github.com/Taiji-pipeline/Taiji
9 stars 3 forks source link

Link_TF_gene: Failed! #2

Closed smanne07 closed 7 years ago

smanne07 commented 7 years ago

Hi Kai, Congratulations on putting together a wonderful integration pipeline,i was testing this on one of our ms datasets and i encountered the following error.

[WARN][04-29 22:18] Link_TF_gene: Failed! [ERROR][04-29 22:18] "Link_TF_gene" failed. The error was: src/Bio/GO/GREAT.hs:38:5-82: Irrefutable pattern failed for pattern (rs, Just (r, a)) .

Can you please help troubleshooting this?

Best regards Sasi

kaizhang commented 7 years ago

Could you show me the first 100 lines of your genome annotation GTF file?

smanne07 commented 7 years ago

Hi, Find attached. gtf_top100.txt

kaizhang commented 7 years ago

Thanks! I found that the chromosome names in the GTF file are 1,2,3, ... So please make sure the genome fasta files also use the same naming convention. For example, the first line should be ">1" instead of ">chr1".

smanne07 commented 7 years ago

Thank you,i will change it and run again to see how it works.

kaizhang commented 7 years ago

@smanne07 Remember to delete all old genome indices, as well as the cache file -- "sciflow.db", because you need a fresh run.

If you prefer the "chr1" naming convention, you can change the names in your GTF file. In this case, you don't have to re-generate genome indices.

smanne07 commented 7 years ago

Hi Kai,

Sure will do that. Thanks

npklein commented 7 years ago

@kaizhang I got the same error and used ensembl gtf and 1000G ref genome and they should have same header

(e.g. genome.fa starts with >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1, and the gtf file starts with 1 pseudogene gene 11869 14412 . )

Is it possible to get more debug information, e.g. what commands/lines of code where used to run GREAT.hs?

Thanks!

kaizhang commented 7 years ago

@npklein Could you also show me the first 100 lines of your GTF file, as well as the error message? Since I found this error is quite common, I will definitely add some codes to identify incompatible GTF file in future. If you are interested, the code that throws this error is here (copied from https://github.com/kaizhang/bioinformatics-toolkit/blob/master/bioinformatics-toolkit/src/Bio/GO/GREAT.hs):

-- | Given a gene list and the rule, compute the rulatory domain for each gene
getRegulatoryDomains :: AssocRule -> [Gene a] -> [(BED3, a)]
getRegulatoryDomains _ [] = error "No gene available for domain assignment!"
getRegulatoryDomains (BasalPlusExtension up dw ext) genes = (extendTail r ext, a) : rs
  where
    (rs, Just (r,a)) = foldl' f ([], Nothing) $ sortBy (compareBed `on` fst) basal
    f (acc, Nothing) (b,x) = (acc, Just (extendHead b ext, x))
    f (acc, Just (b', x')) (b,x)
        | chrom b' /= chrom b = ( (extendTail b' ext, x') : acc
                                , Just (extendHead b ext, x) )
        | chromEnd b' >= chromStart b = ((b',x') : acc, Just (b,x))
        | otherwise = let ext' = min ext $ (chromStart b - chromEnd b') `div` 2
                      in ((extendTail b' ext', x') : acc, Just (extendHead b ext', x))
    extendHead (BED3 chr s e) l | s - l >= 0 = BED3 chr (s-l) e
                                | otherwise = BED3 chr 0 e
    extendTail (BED3 chr s e) l = BED3 chr s (e+l)
    basal = flip map genes $ \((chr, tss, str), x) ->
        if str then (BED3 chr (tss - up) (tss + dw), x)
               else (BED3 chr (tss - dw) (tss + up), x)
getRegulatoryDomains _ _ = undefined
{-# INLINE getRegulatoryDomains #-}

During this step (Link_TF_gene), the program needs to identify active promoters. This is done by first extracting the regions around transcripts' start sites (referred to as G1), and then overlapping these regions with open chromatin (referred to as G2). If the program identify 0 region, it will raise an error. When the chr names in G1 and G2 don't match, it will certainly result into 0 overlap. Another possibility is that the GTF do not contain "transcript" annotations. In this case, G1 has 0 regions, so the result is 0 as well. I do not use "gene" annotations to identify promoters because of the existence of alternative promoters.

npklein commented 7 years ago

@kaizhang Thanks for the explanation! So if there are genes which do not have open chromatin around their start site this would also throw an error?

The first 100 lines: https://gist.github.com/npklein/114939a364164b1ce6c465481576f5aa The headers in genome.fa: https://gist.github.com/npklein/f8fac098fb3d3566e7d1cb6475184aed

edit: I see now that my input.yml is missing Hi-C data, could that cause it?

kaizhang commented 7 years ago

@npklein It throws an error only if ALL genes do not have open chromatin around their start sites, which is impossible unless you get no ATAC-seq peaks from your experiment. Because every peak will be assigned to certain genes (promoters or considered as enhancers). Hi-C is optional, and the GTF file looks fine.

I think the problem may come from the the fasta file. The fasta description lines contain comments. To check this, first go to the "output/ATAC_Seq" directory. Look at one of the files with "_MACS.narrowPeak" suffix. And let me know if they have desired chr names, e.g.,

1 10037863 10038226 NA_peak_14 347 . 8.70037 38.75528 34.71162 190

If the chr names for the peak files are correct, go to the "output/TFBS" directory, look for files with "_tfbs.bed" suffix. Please also let me know the contents of these files.

EDIT: I ran the program on some public data set with ensembl GTF and FASTA today. I didn't see any error, at least for the latest version of Taiji. So the problem may not come from the ensembl GTF and FASTA. I suggest check the peak files and TFBS files first.

npklein commented 7 years ago

The *_MACS.narrowPeak file chr names contain chr (e.g. chr1). Looking through my input files, the BAM files use chr1 instead of 1. I will replace these in the BAM and try again.

Thanks for the help

edit: That worked, finished the complete pipeline now.