tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

[E::fai_retrieve] Failed to retrieve block: unexpected end of file #66

Closed matteodelucchi closed 5 years ago

matteodelucchi commented 5 years ago

Background info: I work with RNAseq data. After mapping the read data I filtered them with samtools view -q 10 -b test.bam > test.aligned_reads.q10.bam followed by sorting and indexing. Which then results in the file test2.1.my_sorted.bam (and test2.1.my_sorted.bam.bai) with the following header:

$ samtools view -H test2.1.my_sorted.bam | grep @RG
@RG ID:0    CN:BCM  DS:TCGA Colon   DT:2011-09-05T10:00:00+0000 LB:IWX_TCOL.A6-2685-01A-T_2pA   PL:Illumina PU:110823_SN881_0126_AD08JDACXX_8_ID07  SM:TCGA-A6-2685-01A-01D-1408-10
@PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem -t 8 -T 0 -R @RG\tCN:BCM\tDS:TCGA Colon\tDT:2011-09-05T10\tID:0\tLB:IWX_TCOL.A6-2685-01A-T_2pA\tPL:Illumina\tPU:110823_SN881_0126_AD08JDACXX_8_ID07\tSM:TCGA-A6-2685-01A-01D-1408-10 /alignment/reference/GRCh38.d1.vd1.fa /alignment/scratchkIHmnG/fastq/rmdup/0_1.fq.gz /alignment/scratchkIHmnG/fastq/rmdup/0_2.fq.gz

actual problem: If I input this file (just for testing purposes) in HipSTR, I get this error:

$ ./HipSTR/HipSTR          --bams       /some/path/test2.1.my_sorted.bam \
>                                    --fasta     /some/path/all_chroms.fa \
>                                    --regions  /some/path/hg19.hipstr_reference.bed \
>                                    --str-vcf   test2.vcf.gz \
>                                    --log       test2.log \
>                                    --viz-out   test2.viz.gz \
>                                    --min-reads 25 --def-stutter-model \

[E::fai_retrieve] Failed to retrieve block: unexpected end of file
HipSTR: src/fasta_reader.h:74: void FastaReader::get_sequence(const string&,td::string&): Assertion `result != __null' failed.
Aborted (core dumped)

First, I thought it might be an issue with the .bam file. However the FastaReader let me go away from that idea... The HipSTR-tutorial works perfectly and I use here the same files for all_chroms.fa and hg19.hipstr_reference.bed as in the tutorial.

tfwillems commented 5 years ago

Hi @matteodelucchi ,

From your BAM header, it looks like you mapped the reads to GRCh38. But the FASTA file and STR reference file you're feeding in are for hg19.

As a result, htslib, an internal library used by HipSTR, crashes while trying to access a region that's not defined.

To fix this issue, you should use the correct GRCh38 FASTA file and STR reference file, the latter of which is available here.

Best, Thomas

matteodelucchi commented 5 years ago

Oh, yes! My bad! Thanks for the hint!

However, with the reference file you linked (GRCh38.hipstr_reference.bed ) and this GRCh38 fasta it gave me this error (log-file):

Detected 1 BAM/CRAM files
BAMs/CRAMs contain unique read group IDs for 1 unique libraries and 1 unique samples
Reading region file /some/path/GRCh38.hipstr_reference.bed
Region file contains 1638945 regions

ERROR: No sequence for chromosome 1 found in the FASTA file
    Please ensure that the chromosome names in your region BED file match those in you FASTA file
    NOTE: Found chromosome chr1 in the FASTA, but not chromosome 1

Which i could resolve using the hg38.hipstr_reference.bed instead.

In case any one else will experience a similar issue in the future, the difference seems to be mostly in the labeling of the chromosomes:

$ head -n 3 hg38.hipstr_reference.bed 
chr1    10001   10468   6   78  Human_STR_1 AACCCT
chr1    16717   16744   3   9.33333 Human_STR_2 GGT
chr1    26454   26465   2   6   Human_STR_3 GT

$ head -n 3 GRCh38.hipstr_reference.bed 
1   10001   10468   6   78  Human_STR_1 AACCCT
1   16717   16744   3   9.33333 Human_STR_2 GGT
1   26454   26465   2   6   Human_STR_3 GT