Closed smanne07 closed 7 years ago
Could you show me the first 100 lines of your genome annotation GTF file?
Hi, Find attached. gtf_top100.txt
Thanks! I found that the chromosome names in the GTF file are 1,2,3, ... So please make sure the genome fasta files also use the same naming convention. For example, the first line should be ">1" instead of ">chr1".
Thank you,i will change it and run again to see how it works.
@smanne07 Remember to delete all old genome indices, as well as the cache file -- "sciflow.db", because you need a fresh run.
If you prefer the "chr1" naming convention, you can change the names in your GTF file. In this case, you don't have to re-generate genome indices.
Hi Kai,
Sure will do that. Thanks
@kaizhang I got the same error and used ensembl gtf and 1000G ref genome and they should have same header
(e.g. genome.fa starts with >1 dna:chromosome chromosome:GRCh37:1:1:249250621:1, and the gtf file starts with
1 pseudogene gene 11869 14412 .
Is it possible to get more debug information, e.g. what commands/lines of code where used to run GREAT.hs?
Thanks!
@npklein Could you also show me the first 100 lines of your GTF file, as well as the error message? Since I found this error is quite common, I will definitely add some codes to identify incompatible GTF file in future. If you are interested, the code that throws this error is here (copied from https://github.com/kaizhang/bioinformatics-toolkit/blob/master/bioinformatics-toolkit/src/Bio/GO/GREAT.hs):
-- | Given a gene list and the rule, compute the rulatory domain for each gene
getRegulatoryDomains :: AssocRule -> [Gene a] -> [(BED3, a)]
getRegulatoryDomains _ [] = error "No gene available for domain assignment!"
getRegulatoryDomains (BasalPlusExtension up dw ext) genes = (extendTail r ext, a) : rs
where
(rs, Just (r,a)) = foldl' f ([], Nothing) $ sortBy (compareBed `on` fst) basal
f (acc, Nothing) (b,x) = (acc, Just (extendHead b ext, x))
f (acc, Just (b', x')) (b,x)
| chrom b' /= chrom b = ( (extendTail b' ext, x') : acc
, Just (extendHead b ext, x) )
| chromEnd b' >= chromStart b = ((b',x') : acc, Just (b,x))
| otherwise = let ext' = min ext $ (chromStart b - chromEnd b') `div` 2
in ((extendTail b' ext', x') : acc, Just (extendHead b ext', x))
extendHead (BED3 chr s e) l | s - l >= 0 = BED3 chr (s-l) e
| otherwise = BED3 chr 0 e
extendTail (BED3 chr s e) l = BED3 chr s (e+l)
basal = flip map genes $ \((chr, tss, str), x) ->
if str then (BED3 chr (tss - up) (tss + dw), x)
else (BED3 chr (tss - dw) (tss + up), x)
getRegulatoryDomains _ _ = undefined
{-# INLINE getRegulatoryDomains #-}
During this step (Link_TF_gene), the program needs to identify active promoters. This is done by first extracting the regions around transcripts' start sites (referred to as G1), and then overlapping these regions with open chromatin (referred to as G2). If the program identify 0 region, it will raise an error. When the chr names in G1 and G2 don't match, it will certainly result into 0 overlap. Another possibility is that the GTF do not contain "transcript" annotations. In this case, G1 has 0 regions, so the result is 0 as well. I do not use "gene" annotations to identify promoters because of the existence of alternative promoters.
@kaizhang Thanks for the explanation! So if there are genes which do not have open chromatin around their start site this would also throw an error?
The first 100 lines: https://gist.github.com/npklein/114939a364164b1ce6c465481576f5aa The headers in genome.fa: https://gist.github.com/npklein/f8fac098fb3d3566e7d1cb6475184aed
edit: I see now that my input.yml is missing Hi-C data, could that cause it?
@npklein It throws an error only if ALL genes do not have open chromatin around their start sites, which is impossible unless you get no ATAC-seq peaks from your experiment. Because every peak will be assigned to certain genes (promoters or considered as enhancers). Hi-C is optional, and the GTF file looks fine.
I think the problem may come from the the fasta file. The fasta description lines contain comments. To check this, first go to the "output/ATAC_Seq" directory. Look at one of the files with "_MACS.narrowPeak" suffix. And let me know if they have desired chr names, e.g.,
1 10037863 10038226 NA_peak_14 347 . 8.70037 38.75528 34.71162 190
If the chr names for the peak files are correct, go to the "output/TFBS" directory, look for files with "_tfbs.bed" suffix. Please also let me know the contents of these files.
EDIT: I ran the program on some public data set with ensembl GTF and FASTA today. I didn't see any error, at least for the latest version of Taiji. So the problem may not come from the ensembl GTF and FASTA. I suggest check the peak files and TFBS files first.
The *_MACS.narrowPeak file chr names contain chr (e.g. chr1). Looking through my input files, the BAM files use chr1 instead of 1. I will replace these in the BAM and try again.
Thanks for the help
edit: That worked, finished the complete pipeline now.
Hi Kai, Congratulations on putting together a wonderful integration pipeline,i was testing this on one of our ms datasets and i encountered the following error.
[WARN][04-29 22:18] Link_TF_gene: Failed! [ERROR][04-29 22:18] "Link_TF_gene" failed. The error was: src/Bio/GO/GREAT.hs:38:5-82: Irrefutable pattern failed for pattern (rs, Just (r, a)) .
Can you please help troubleshooting this?
Best regards Sasi