Fasta files - Githubissues

laranonell commented 6 years ago

Dear Vincent,

I am facing a problem with fasta files

bis1 <- getBamCounts(bed.frame = regions.df, bam.files = file.path("./BAMs_all","qG16018101bis.final.bam"), include.chr = FALSE, referenceFasta = "Z:/RNA_seq/Ref_Genomes_fa/hg19.fa") Reference fasta file provided so ExomeDepth will compute the GC content in each window Error in value[3L] : record 1 (1:1138878-1139350) failed file: Z:/RNA_seq/Ref_Genomes_fa/hg19.fa

I have tried also with hg38.fa (both fasta files are being used by other methods with no problems. If I ommit the fasta files it works but results are not what I expect so I am trying to make the method adjust by GC content as a first trial. Best,

Lara

sessionInfo()

R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0 XLConnect_0.2-14
[7] XLConnectJars_0.2-14 ExomeDepth_1.1.10

loaded via a namespace (and not attached): [1] XVector_0.18.0 splines_3.4.3 zlibbioc_1.24.0 GenomicAlignments_1.14.1 BiocParallel_1.12.0
[6] lattice_0.20-35 tools_3.4.3 SummarizedExperiment_1.8.1 grid_3.4.3 Biobase_2.38.0
[11] matrixStats_0.52.2 yaml_2.1.16 Matrix_1.2-12 rJava_0.9-9 GenomeInfoDbData_1.0.0
[16] bitops_1.0-6 RCurl_1.95-4.10 aod_1.3 VGAM_1.0-5 DelayedArray_0.4.1
[21] compiler_3.4.3 Rsamtools_1.30.0 Biostrings_2.46.0

vplagnol commented 6 years ago

Any luck if you set include.chr = TRUE? I think the pbm is a confusion between "chr1" and "1".

laranonell commented 6 years ago

No,

my files do not contain chr (see below). Curious is that if no fasta file is specified it works, fasta contains (as standard fasta format) "chr", could this be the problem?

head(regions.df) chromosome start end name 1 1 1 28000000 1p36 2 6 70000001 164500000 6q13-q26 3 7 98000001 147900000 7q22.1-q35 5 9 19900001 33200000 9p21 6 11 97200001 130800000 11q22.1-q24 7 13 40100001 55300000 13q14

library(Rsamtools) test<-file.path("./BAMs_all","qG16018101bis.final.bam") res.test<-scanBam(test) res.test[[1]]$rname #chr [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [421] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [491] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [631] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [701] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [771] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [841] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [911] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [981] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [ reached getOption("max.print") -- omitted 11724840 entries ] 85 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT GL000207.1 GL000226.1 GL000229.1 GL000231.1 GL000210.1 ... NC_007605

vplagnol commented 6 years ago

I think the description of the argument is wrong in the current version of exomeDepth and it does the opposite of what is being stated. What happens if you change the argument? Do you get the same output?

laranonell commented 6 years ago

bis1 <- getBamCounts(bed.frame = regions.df, bam.files = file.path("./BAMs_all","qG16018101bis.final.bam"), include.chr = TRUE, referenceFasta = "Z:/RNA_seq/Ref_Genomes_fa/hg19.fa") Reference fasta file provided so ExomeDepth will compute the GC content in each window Parse 1 BAM files [1] "./BAMs_all/qG16018101bis.final.bam" Now parsing ./BAMs_all/qG16018101bis.final.bam [1] "Problematic sequences:" [1] "chr1" "chr6" "chr7" "chr9" "chr11" "chr13" "chr17" "chr2" "chr3" "chr8" "chr12" "chr18" Error in countBamInGRanges.exomeDepth(bam.file = bam, index = index, granges = target, : Some sequences in the target data frame cannot be found in the index of the BAM file

laranonell commented 6 years ago

I finally solved the issue by removing the chr from my fasta file using Biostrings so now everyting does not contain chr

AndreaG5 commented 2 years ago

I finally solved the issue by removing the chr from my fasta file using Biostrings so now everyting does not contain chr

Hi, I'm having the same problem. My bed and fasta are formatted with 'chrN', but the error 'Error in value cond[[3L]] : record 1 (chrchr1:8022746-8023062) failed' occurs. So I read this topic and tried the same. I removed 'chr' char in fasta using 'sed', but the error still occurs ('Error in value cond[[3L]] : record 1 (chr1:8022746-8023062) failed').

As log error says I correctly replace 'chr' with '' but the function continues to output me error. I want to know if you think I have to change bed files too or if I have to format fasta in another way (maybe as you did with biostrings).

Thank you so much!!

P.S. I tried both 'include.chr=T/F'

evy-beckers commented 2 years ago

Hi @AndreaG5 ,

I encountered the same problem and I managed to solve it (for my data at least). The problem for me wasn't in the size or naming of the chromosomes, but was located in my reference index file. I was using a subset of my data (one chromosome) and had changed the heading of my FASTA-file (in accordance to the chromosme naming in my BED file). However, the index-file of my reference hadn't been changed accordingly (I adjusted it manually, but you can also re-index your FASTA file). Maybe your problem lies here as well? Hope this helps!

vplagnol / ExomeDepth

Fasta files #14